0% Complete
Home
/
11th International Conference on Computer and Knowledge Engineering
Semi-automatic Detection of Persian Stopwords using FastText Library
Authors :
Mohammad Dehghani
1
Mohammad Manthouri
2
1- Tarbiat Modares University
2- Shahed University
Keywords :
stopword, natural language processing, fastText, word embedding, text mining
Abstract :
A stopword is a word that does not add much semantic information to the text that despite of its very high frequency. Stopwords include prepositions, conjunctions, and pronouns. One of the steps in natural language processing is to remove stopwords to reduce dataset size and process faster. In this study, a semi-automatic method for collecting the Persian language's stopwords is proposed. The proposed method lists the stopwords of each text depending on its subject. For this purpose, based on a corpus of news texts, the Inverse Document Frequency (IDF) weight in the text is calculated for each word and the stopwords candidates are determined. Then, using the fastText library, the vector of each word is obtained. In the next step, five neighbors are found for each vector. Next, by removing duplicate words, the final list of stopwords (1014 stopwords) is collected. The result of simulations show the accuracy of detecting stopwords by the k-nearest neighbor method is 94.6%.
Papers List
List of archived papers
Extreme Gradient Boosting (XGBoost) Regressor and Shapley Additive Explanation for Crop Yield Prediction in Agriculture
Dennis A/L Mariadass - Ervin Gubin Moung - Maisarah Mohd Sufian - Ali Farzamnia
A New Application of Machine Learning Based Methods for Disk Space Variation Fault Diagnosis in Transformer Windings
Reza Behkam - Amir Lotfi - Gevork B. Gharehpetian
Emotion Recognition In Persian Speech Using Deep Neural Networks
Ali Yazdani - Hossein Simchi - Yasser Shekofteh
A Weighted TF-IDF-based Approach for Authorship Attribution
Ali Abedzadeh - Reza Ramezani - Afsaneh Fatemi
Lossless Watermarking in Encrypted Triangular Mesh Models Based on Optimized Vertex Estimation and Error Histogram Shifting
Alireza Ghaemi - Habibollah Danyali - Kamran Kazemi - Zahra Qodrati - Amirhossein Ghaemi - Seyedeh Masoumeh Taji
A Review on Secure Data Storage and Data Sharing Technics in Blockchain-based IoT Healthcare Systems
Seyedeh Somayeh Fatemi Nasab - Davoud Bahrepour - Seyed Reza Kamel Tabbakh
A Graph-based Feature Selection using Class-Feature Association Map (CFAM)
Motahare Akhavan - Seyed Mohammad Hossein Hasheminejad
Non-Functional Requirement Extracting Methods for AI-based Systems: A Survey
Reza Damirchi - Amineh Amini
An Exploratory Study of the Relationship between SATD and Other Software Development Activities
Shima Esfandiari - Ashkan Sami
Span-prediction of Unknown Values for Long-sequence Dialogue State Tracking
Marzieh Naghdi Dorabati - Reza Ramezani - Mohammad Ali Nematbakhsh
more
Samin Hamayesh - Version 41.7.6