0% Complete
Home
/
12th International Conference on Computer and Knowledge Engineering
Introducing E4MT and LMBNC: Persian pre-processing utilities
Authors :
Zakieh Shakeri
1
Mehran Ziabary
2
Behrooz Vedadian
3
Fatemeh Azadi
4
Saeed Torabzadeh
5
Arian Atefi
6
1- Targoman Intelligent Processing Co. Pjc.
2- Targoman Intelligent Processing Co. Pjc.
3- Targoman Intelligent Processing Co. Pjc.
4- Targoman Intelligent Processing Co. Pjc.
5- Amirkabir university of technology
6- Targoman Intelligent Processing Co. Pjc.
Keywords :
Natural language processing،Neural Machine Translation،Persian text pre-processing
Abstract :
In this paper, we introduce two utilities, extensively used in our services and products. A Persian pre-processor(E4MT) we use for both training and inference in our machine translation services and a corpora-level language model-based error corrector(LMBNC), which we apply to corpora before training. E4MT(Essential tools for MT) consists of character normalization, spell correction, entity tagging, and tokenization/detokenization modules. It handles the Persian large vocabulary size problem by approximately reducing the vocabulary size by a factor of 2. We show that applying E4MT on the English-Persian translation task, yields an improvement of at least 1.2 BLEU over other toolkits. We apply LMBNC on the training corpora, which uses a domain-specific language model to identify context-dependent misspellings. The results show, using this corrected training corpora improves the English-Persian translation quality by 0.6 BLEU over its baseline. Additionally, the manual evaluation shows 97.9\% precision for E4MT and 98.1\% precision for LMBNC.
Papers List
List of archived papers
LLM-Driven AutoML for Cross-Lingual Handwritten OCR: Closed-Loop Neural Architecture Search with GPT-5, GPT-4o, and Claude Sonnet 4
Mobina Kashaniyan - Amirhossein Ghassemi - Nasser Mozayani
Smart Home Connectivity: Identifying the Best IoT Application Layer Protocols
Hossein Shahinzadeh - Zohreh Azani - Sundus F. Al-Hameedawi - S. Mohammadali Zanjani - Saiedeh Mehrabani-Najafabadi - Mohammadreza Hemmati
Ramp Progressive Secret Image Sharing using Ensemble of Simple Methods
Atieh Mokhtari - Mohammad Taheri
Practical Implementation of Real-Time Waste Detection and Recycling based on Deep Learning for Delta Parallel Robot
Hasan Jalali - Shaya Garjani - Ahmad Kalhor - Mehdi Tale Masouleh - Parisa Yousefi
A Weighted TF-IDF-based Approach for Authorship Attribution
Ali Abedzadeh - Reza Ramezani - Afsaneh Fatemi
Robustness Scan of Digital Circuits Using Convolutional Neural Networks
Mobin Vaziri - Mohammad Mehdi Rahimifar - Hadi Jahanirad
Islamic Geometric algorithms: A survey
Elham Akbari - Azam Bastanfard
Dual-Mode Density-Aware Attention-based Hierarchical Graph Pooling
Roya Booryaee - Parsa Haddadian - Ali Kamandi
Underwater Image Super-Resolution using Generative Adversarial Network-based Model
Alireza Aghelan - Modjtaba Rouhani
Depression Diagnosis Using Optimization of Nonlinear EEG Features Based on Parametric Learning Tactics
Ali Asadi Zeidabadi - Melika Changizi - Mahdi Zolfagharzadeh Kermani - Sara Bargi Barkouk
more
Samin Hamayesh - Version 43.7.0