0% Complete
Home
/
12th International Conference on Computer and Knowledge Engineering
Introducing E4MT and LMBNC: Persian pre-processing utilities
Authors :
Zakieh Shakeri
1
Mehran Ziabary
2
Behrooz Vedadian
3
Fatemeh Azadi
4
Saeed Torabzadeh
5
Arian Atefi
6
1- Targoman Intelligent Processing Co. Pjc.
2- Targoman Intelligent Processing Co. Pjc.
3- Targoman Intelligent Processing Co. Pjc.
4- Targoman Intelligent Processing Co. Pjc.
5- Amirkabir university of technology
6- Targoman Intelligent Processing Co. Pjc.
Keywords :
Natural language processing،Neural Machine Translation،Persian text pre-processing
Abstract :
In this paper, we introduce two utilities, extensively used in our services and products. A Persian pre-processor(E4MT) we use for both training and inference in our machine translation services and a corpora-level language model-based error corrector(LMBNC), which we apply to corpora before training. E4MT(Essential tools for MT) consists of character normalization, spell correction, entity tagging, and tokenization/detokenization modules. It handles the Persian large vocabulary size problem by approximately reducing the vocabulary size by a factor of 2. We show that applying E4MT on the English-Persian translation task, yields an improvement of at least 1.2 BLEU over other toolkits. We apply LMBNC on the training corpora, which uses a domain-specific language model to identify context-dependent misspellings. The results show, using this corrected training corpora improves the English-Persian translation quality by 0.6 BLEU over its baseline. Additionally, the manual evaluation shows 97.9\% precision for E4MT and 98.1\% precision for LMBNC.
Papers List
List of archived papers
An Energy-efficient Clustering Method based on Butterfly Optimization Algorithm by Considering the Criterion of Intra-cluster Distances in WSNs
Fariba Saghi Hadi S. Aghdasi
Identification of Botnets and Nodes Attacking Smart Cities by Majority Voting Mechanism and Feature Selection
Maliheh Araghchi - Nazbanoo Farzaneh
BERT transformers Multitask learning Sarcasm and Sentiment classification (BMSS)
Fatemeh Molavi - Jamshid Bagherzadeh Mohasefi
Designing an IT2 Fuzzy Rule-based System for Emotion Recognition Using Biological Data
Mahsa Keshtkar - Hooman Tahayori
Improving LoRaWAN Scalability for IoT Applications using Context Information
Hamed Mahmoudi - Behrouz ShahgholiGhahfarokhi
R2-BAC: A Novel Blockchain and IoT-Based Access Control Model for Supply Chain Management
Sadegh Sohani - Farnaz Kamranfar - Haleh Amintoosi - Mohammad Allahbakhsh
Improving ADHD Detection with Cost-Sensitive LightGBM
Behnam Yousefimehr - Mehdi Ghatee - Ali Heydari
Evaluation of Efficient Electrocardiomatrix-based Identification Using Deep Learning Methods
Amirhossein Safari - Narges Mokhtari - Mohsen Hooshmand - Sadegh Sadeghi - Peyman Pahlevani
A Framework for Automated Cardiovascular Magnetic Resonance Image Quality Scoring based on EuroCMR Registry Criteria
Shahabedin Nabavi - Mohsen Ebrahimi Moghaddam - Ahmad Ali Abin - Alejandro Frangi
Histopathology Image-Based Cancer Classification Utilizing Transfer Learning Approach
Amir Meydani - Alireza Meidani - Ali Ramezani - Maryam Shabani - Mohammad Mehdi Kazeminasab - Shahriar Shahablavasani
more
Samin Hamayesh - Version 42.7.0