0% Complete
Home
/
12th International Conference on Computer and Knowledge Engineering
Introducing E4MT and LMBNC: Persian pre-processing utilities
Authors :
Zakieh Shakeri
1
Mehran Ziabary
2
Behrooz Vedadian
3
Fatemeh Azadi
4
Saeed Torabzadeh
5
Arian Atefi
6
1- Targoman Intelligent Processing Co. Pjc.
2- Targoman Intelligent Processing Co. Pjc.
3- Targoman Intelligent Processing Co. Pjc.
4- Targoman Intelligent Processing Co. Pjc.
5- Amirkabir university of technology
6- Targoman Intelligent Processing Co. Pjc.
Keywords :
Natural language processing،Neural Machine Translation،Persian text pre-processing
Abstract :
In this paper, we introduce two utilities, extensively used in our services and products. A Persian pre-processor(E4MT) we use for both training and inference in our machine translation services and a corpora-level language model-based error corrector(LMBNC), which we apply to corpora before training. E4MT(Essential tools for MT) consists of character normalization, spell correction, entity tagging, and tokenization/detokenization modules. It handles the Persian large vocabulary size problem by approximately reducing the vocabulary size by a factor of 2. We show that applying E4MT on the English-Persian translation task, yields an improvement of at least 1.2 BLEU over other toolkits. We apply LMBNC on the training corpora, which uses a domain-specific language model to identify context-dependent misspellings. The results show, using this corrected training corpora improves the English-Persian translation quality by 0.6 BLEU over its baseline. Additionally, the manual evaluation shows 97.9\% precision for E4MT and 98.1\% precision for LMBNC.
Papers List
List of archived papers
Leveraging a structure-based and learning-based predictor using various feature groups in bioinformatics (case study: protein-peptide region residue-level interaction)
Shima Shafiee - Abdolhossein Fathi
SGFL: A Federated Learning Approach for Non-IID Data Using Semi-Supervised DCGAN
Alireza Rabiee - Abolfazl Ajdarloo - Mohsen Rahmani
Compressing Deep Neural Networks Using Explainable AI
Kimia Soroush - Mohsen Raji - Behnam Ghavami
Leveraging the Power of Object Detection Models in Identifying Litter for a Significant Reduction in Environmental Pollution
Lim Zhen Xian - Ervin Gubin Moung - Jason Teo Tze Wi - Nordin Saad - Farashazillah Yahya - Tiong Lin Rui - Ali Farzamnia
A Framework for Automated Cardiovascular Magnetic Resonance Image Quality Scoring based on EuroCMR Registry Criteria
Shahabedin Nabavi - Mohsen Ebrahimi Moghaddam - Ahmad Ali Abin - Alejandro Frangi
Adaptive Active Queue Management for Time Slot Channel Hopping in Industrial Internet of Things
Mehdi Zirak - Yasser Sedaghat - Mohammad Hossein Yaghmaee Moghaddam
Driving Violation Detection Using Vehicle Data and Environmental Conditions
Masood Ghasemi - Mahmood Fathy - Mohammad Shahverdy
Enhanced Atrial Fibrillation (AF) Detection via Data Augmentation with Diffusion Model
Arash Vashagh - Amirhossein Akhoondkazemi - Sayed Jalal Zahabi - Davood Shafie
WBT-GAN:Wavelet based Generative Adversarial Network for Texture Synthesis
Sara Saberi moghadam - Reza Azmi - Maral Zarvani
Diagnosis of Depression Based on New Features Extractive from the Frequency Space of the EEG
Melika Changizi - Saeid Rashidi
more
Samin Hamayesh - Version 42.4.1