0% Complete
Home
/
11th International Conference on Computer and Knowledge Engineering
Semi-automatic Detection of Persian Stopwords using FastText Library
Authors :
Mohammad Dehghani
1
Mohammad Manthouri
2
1- Tarbiat Modares University
2- Shahed University
Keywords :
stopword, natural language processing, fastText, word embedding, text mining
Abstract :
A stopword is a word that does not add much semantic information to the text that despite of its very high frequency. Stopwords include prepositions, conjunctions, and pronouns. One of the steps in natural language processing is to remove stopwords to reduce dataset size and process faster. In this study, a semi-automatic method for collecting the Persian language's stopwords is proposed. The proposed method lists the stopwords of each text depending on its subject. For this purpose, based on a corpus of news texts, the Inverse Document Frequency (IDF) weight in the text is calculated for each word and the stopwords candidates are determined. Then, using the fastText library, the vector of each word is obtained. In the next step, five neighbors are found for each vector. Next, by removing duplicate words, the final list of stopwords (1014 stopwords) is collected. The result of simulations show the accuracy of detecting stopwords by the k-nearest neighbor method is 94.6%.
Papers List
List of archived papers
CSI-Based Human Activity Recognition using Convolutional Neural Networks
Parisa Fard Moshiri - Mohammad Nabati - Reza Shahbazian - Seyed Ali Ghorashi
Taguchi Design of Experiments Application in Robust sEMG Based Force Estimation
Mohsen Ghanaei - Hadi Kalani - Alireza Akbarzadeh
Improving the classification of high dimensional class-imbalanced data using the Chaos particle swarm optimization with Levy Flight
Mohammad Ali Zarif - Javad Hamidzadeh
An Efficient Approach for Breast Abnormality Detection through High-Level Features of Thermography Images
Farhad Abedinzadeh Torghabeh - Yeganeh Modaresnia - Seyyed Abed Hosseini
Detecting Non-Spherical Clusters Using Modified CURE Algorithm
Arezou Safdari - Pedram Salehpour
Enhancing Persian Word Sense Disambiguation with Large Language Models: Techniques and Applications
Fatemeh Zahra Arshia - Saeedeh Sadat Sadidpour
Optimizing the controller placement problem in SDN with uncertain parameters with robust optimization
Mohammad Kazemi - AhmadReza Montazerolghaem
Two-step thermal-aware routing algorithm in 3D NoC
Majid Nezarat - Masoume Momeni
Non-Negative Matrix Factorization improves Residual Neural Networks
Hojjat Moayed
ExaASC: A General Target-Based Stance Detection Corpus in Arabic Language
Mohammad Mehdi Jaziriyan - Ahmad Akbari - Hamed Karbasi
more
Samin Hamayesh - Version 42.2.1