0% Complete
Home
/
15th International Conference on Computer and Knowledge Engineering
ParsHomo: A T5-Powered Approach to High-Precision Persian Homograph Disambiguation
Authors :
Hasan Jalali
1
Taha Mohaddesi
2
1- School of Electrical and Computer Engineering College of Engineering, University of Tehran
2- ASMIN AI Research Center
Keywords :
Homograph Disambiguation،Persian Language Processing،Grapheme-to-Phoneme Conversion،T5 Transformer،Natural Language Processing،Text-to-Speech
Abstract :
This paper presents ParsHomo, a high-precision model for Persian homograph disambiguation using fine-tuned T5-based architectures (mT5 and ByT5). Starting from the HomoRich dataset, we manually curated and refined the human-labeled portion to ensure exceptional data quality. The cleaned dataset was used to fine-tune T5 models, leading to substantial improvements over previous systems. We evaluated various input formatting strategies—including question-style, label-based, and separator-based approaches—and used the trained model to relabel large-scale datasets such as HomoRich-GPT4o and GE2PE, resulting in a bootstrapped corpus of over 4.19 million sentences. ParsHomo offers the most accurate and scalable homograph disambiguation system for Persian to date, demonstrating the effectiveness of transformer-based models for contextaware phoneme generation in low-resource languages. Our bestperforming model (mT5 with label-based format) achieved an accuracy of 94.55% and a Character Error Rate of 1.71%, outperforming prior approaches by a wide margin.
Papers List
List of archived papers
Predicting the Recovery Rate of COVID-19 Using a Novel Hybrid Method
Fatemeh Ahouz - Ebrahim Sayahi
Community-Based QoE Enhancement for User-Generated Content Live Streaming
Reza Saeedinia - S.Omid Fatemi - Daniele Lorenzi - Farzad Tashtarian - Christian Timmerer
An Efficient Approach for Breast Abnormality Detection through High-Level Features of Thermography Images
Farhad Abedinzadeh Torghabeh - Yeganeh Modaresnia - Seyyed Abed Hosseini
A Hybrid Echo State Network for Hypercomplex Pattern Recognition, Classification, and Big Data Analysis
Mohammad Jamshidi - Fatemeh Daneshfar
Diagnosis of Depression Based on New Features Extractive from the Frequency Space of the EEG
Melika Changizi - Saeid Rashidi
The process of multi class fake news dataset generation
Sajjad Rezaei - Mohsen Kahani - Behshid Behkamal
An Advanced Dual Attention-based U-Net Using Breast Ultrasound Data for Image Segmentation
Erfan Akbarnezhad Sany - Niloufar Asghari - Fatemeh Naserizadeh - Seyyed Abed Hosseini
A Novel Density-Based KNN in Pattern Recognition
Sajad Haghzad Klidbary - Abazar Arabameri
Effect of Tissue Excitation in Breast Cancer Detection from Ultrasound RF Time Series: Phantom studies
Elaheh Norouzi Ghehi - Ali Fallah - Saeid Rashidi - Maryam Mehdizadeh Dastjerdi
Graph Representation Learning Towards Patents Network Analysis
Mohammad Heydari - Babak Teimourpour
more
Samin Hamayesh - Version 43.7.0