0% Complete
Home
/
11th International Conference on Computer and Knowledge Engineering
Semi-automatic Detection of Persian Stopwords using FastText Library
Authors :
Mohammad Dehghani
1
Mohammad Manthouri
2
1- Tarbiat Modares University
2- Shahed University
Keywords :
stopword, natural language processing, fastText, word embedding, text mining
Abstract :
A stopword is a word that does not add much semantic information to the text that despite of its very high frequency. Stopwords include prepositions, conjunctions, and pronouns. One of the steps in natural language processing is to remove stopwords to reduce dataset size and process faster. In this study, a semi-automatic method for collecting the Persian language's stopwords is proposed. The proposed method lists the stopwords of each text depending on its subject. For this purpose, based on a corpus of news texts, the Inverse Document Frequency (IDF) weight in the text is calculated for each word and the stopwords candidates are determined. Then, using the fastText library, the vector of each word is obtained. In the next step, five neighbors are found for each vector. Next, by removing duplicate words, the final list of stopwords (1014 stopwords) is collected. The result of simulations show the accuracy of detecting stopwords by the k-nearest neighbor method is 94.6%.
Papers List
List of archived papers
Intracranial Hemorrhage Classification using CBAM Attention Module and Convolutional Neural Networks
Parnian Rahimi - Marjan Naderan - Amir Jamshidnezhad - Shahram Rafie
Improving Soft Error Reliability of FPGA-based Deep Neural Networks with Reduced Approximate TMR
Anahita Hosseinkhani - Behnam Ghavami
T-Rank: Graph Data Analytics for Urban Traffic Modeling
Alireza Safarpour - Iman Gholampour - Amirhossain Aghazadeh Fard - Seyed Mohammad Karbasi
An Effective Connectomics Approach for Diagnosing ADHD using Eyes-open Resting-state MEG
Nastaran Hamedi - Ali Khadem - Sajjad Vardast - Mehdi Delrobaei - Abbas Babajani-Feremi
MIPS-Core Application Specific Instruction-Set Processor for IDEA Cryptography − Comparison between Single-Cycle and Multi-Cycle Architectures
Ahmad Ahmadi - Reza Faghih Mirzaee
A Novel Hybrid Method for Clustering Text Documents using Evolutionary Optimization
Muhammad Naderi - Maryam Amiri
A Framework for Automated Cardiovascular Magnetic Resonance Image Quality Scoring based on EuroCMR Registry Criteria
Shahabedin Nabavi - Mohsen Ebrahimi Moghaddam - Ahmad Ali Abin - Alejandro Frangi
Low-Cost and Hardware Efficient Implementation of Pooling Layers for Stochastic CNN Accelerators
Mobin Vaziri - Hadi Jahanirad
Predicting the Recovery Rate of COVID-19 Using a Novel Hybrid Method
Fatemeh Ahouz - Ebrahim Sayahi
IranITJobs2021: a Dataset for Analyzing Iranian Online IT Job Advertisements Collected Using a New Crowdsourcing Process
Fakhroddin Noorbehbahani - Nikta Akbarpour - Mohammad Reza Saeidi
more
Samin Hamayesh - Version 42.4.1