0% Complete
Home
/
12th International Conference on Computer and Knowledge Engineering
Introducing E4MT and LMBNC: Persian pre-processing utilities
Authors :
Zakieh Shakeri
1
Mehran Ziabary
2
Behrooz Vedadian
3
Fatemeh Azadi
4
Saeed Torabzadeh
5
Arian Atefi
6
1- Targoman Intelligent Processing Co. Pjc.
2- Targoman Intelligent Processing Co. Pjc.
3- Targoman Intelligent Processing Co. Pjc.
4- Targoman Intelligent Processing Co. Pjc.
5- Amirkabir university of technology
6- Targoman Intelligent Processing Co. Pjc.
Keywords :
Natural language processing،Neural Machine Translation،Persian text pre-processing
Abstract :
In this paper, we introduce two utilities, extensively used in our services and products. A Persian pre-processor(E4MT) we use for both training and inference in our machine translation services and a corpora-level language model-based error corrector(LMBNC), which we apply to corpora before training. E4MT(Essential tools for MT) consists of character normalization, spell correction, entity tagging, and tokenization/detokenization modules. It handles the Persian large vocabulary size problem by approximately reducing the vocabulary size by a factor of 2. We show that applying E4MT on the English-Persian translation task, yields an improvement of at least 1.2 BLEU over other toolkits. We apply LMBNC on the training corpora, which uses a domain-specific language model to identify context-dependent misspellings. The results show, using this corrected training corpora improves the English-Persian translation quality by 0.6 BLEU over its baseline. Additionally, the manual evaluation shows 97.9\% precision for E4MT and 98.1\% precision for LMBNC.
Papers List
List of archived papers
Recommending Popular Locations Based on Collected Trajectories
Mohammad Rabbani bidgoli - Saber Ziaei
Using Deep Learning for Classification of Lung Cancer on CT Images in Ardabil Province
Mohammad Ali Javadzadeh Barzaki - Jafar Abdollahi - Mohammad Negaresh - Maryam Salimi - Hadi Zolfeghari - Mohsen Mohammadi - Asma Salmani - Rona Jannati - Firouz Amani
Deep Learning-Based Malaysian Sign Language (MSL) Recognition: Exploring the Impact of Color Spaces
Ervin Gubin Moung - Precilla Fiona Suwek - Maisarah Mohd Sufian - Valentino Liaw - Ali Farzamnia - Wei Leong Khong
Designing an IT2 Fuzzy Rule-based System for Emotion Recognition Using Biological Data
Mahsa Keshtkar - Hooman Tahayori
Evaluation of Efficient Electrocardiomatrix-based Identification Using Deep Learning Methods
Amirhossein Safari - Narges Mokhtari - Mohsen Hooshmand - Sadegh Sadeghi - Peyman Pahlevani
Classification of benign and malignant tumors in Digital Breast Tomosynthesis images using Radiomic-based methods
Farangis Sajadi moghadam - Saeid Rashidi
Parallel Local Feature Selection For High-dimensional Data
Zhaleh Manbari - Chiman Salavati - Fardin AkhlaghianTab - Barzan Saeedpoor - Himan Delbina - Mahmud Abdulla Mohammad
Ramp Progressive Secret Image Sharing using Ensemble of Simple Methods
Atieh Mokhtari - Mohammad Taheri
Optimizing Question-Answering Framework Through Integration of Text Summarization Model and Third-Generation Generative Pre-Trained Transformer
Ervin Gubin Moung - Toh Sin Tong - Maisarah Mohd Sufian - Valentino Liaw - Ali Farzamnia - Farashazillah Yahya
Non-Functional Requirement Extracting Methods for AI-based Systems: A Survey
Reza Damirchi - Amineh Amini
more
Samin Hamayesh - Version 41.7.6