0% Complete
Home
/
11th International Conference on Computer and Knowledge Engineering
Semi-automatic Detection of Persian Stopwords using FastText Library
Authors :
Mohammad Dehghani
1
Mohammad Manthouri
2
1- Tarbiat Modares University
2- Shahed University
Keywords :
stopword, natural language processing, fastText, word embedding, text mining
Abstract :
A stopword is a word that does not add much semantic information to the text that despite of its very high frequency. Stopwords include prepositions, conjunctions, and pronouns. One of the steps in natural language processing is to remove stopwords to reduce dataset size and process faster. In this study, a semi-automatic method for collecting the Persian language's stopwords is proposed. The proposed method lists the stopwords of each text depending on its subject. For this purpose, based on a corpus of news texts, the Inverse Document Frequency (IDF) weight in the text is calculated for each word and the stopwords candidates are determined. Then, using the fastText library, the vector of each word is obtained. In the next step, five neighbors are found for each vector. Next, by removing duplicate words, the final list of stopwords (1014 stopwords) is collected. The result of simulations show the accuracy of detecting stopwords by the k-nearest neighbor method is 94.6%.
Papers List
List of archived papers
Stock market prediction using multi-objective optimization
Mahshid Zolfaghari - Hamid Fadishei - Mohsen Tajgardan - Reza Khoshkangini
TrackMine: Topic Tracking in Model Mining using Genetic Algorithm
Mohammad Sajad Kasaei - Mohammadreza Sharbaf - Afsaneh Fatemi - Bahman Zamani
MC-BioCLIPSR: A Mamba-CNN Hybrid Network with BioMedCLIP-Guided Loss for High-Resolution Brain MRI Reconstruction
Amin Kazempour - Jafar Tanha - SeyedEhsan Roshan - Mahdi Zarrin - Haniyeh Nikkhah
Characterizing Microsatellite Distribution Patterns Across Distinct Gene Categories in Human
Elahe Mehrazin - Mahmoud Naghibzadeh - Sara Jamali
TriFuse-PdM: High-Fidelity Machine Failure Prediction Using Hybrid Resampling and Model Calibration
Saghar Shafaati - Javad Mohammadzadeh
Improving performance of multi-label classification using ensemble of feature selection and outlier detection
Mohammad Ali Zarif - Javad Hamidzadeh
TCAR: Thermal and Congestion-Aware Routing Algorithm in a Partially Connected 3D Network on Chip
Majid Nezarat - Masoomeh Momeni
Distilled BERT Model In Natural Language Processing
Yazdan Zandiye Vakili - Avisa Fallah - Hedieh Sajedi
Analysis of Address Lifespans in Bitcoin and Ethereum
Amir Mohammad Karimi Mamaghan - Amin Setayesh - Behnam Bahrak
An optimal workflow scheduling method in cloud-fog computing using three-objective Harris-Hawks algorithm
Ahmadreza Montazerolghaem - Maryam Khosravi - Fatemeh Rezaee
more
Samin Hamayesh - Version 43.7.0