0% Complete
Home
/
11th International Conference on Computer and Knowledge Engineering
Semi-automatic Detection of Persian Stopwords using FastText Library
Authors :
Mohammad Dehghani
1
Mohammad Manthouri
2
1- Tarbiat Modares University
2- Shahed University
Keywords :
stopword, natural language processing, fastText, word embedding, text mining
Abstract :
A stopword is a word that does not add much semantic information to the text that despite of its very high frequency. Stopwords include prepositions, conjunctions, and pronouns. One of the steps in natural language processing is to remove stopwords to reduce dataset size and process faster. In this study, a semi-automatic method for collecting the Persian language's stopwords is proposed. The proposed method lists the stopwords of each text depending on its subject. For this purpose, based on a corpus of news texts, the Inverse Document Frequency (IDF) weight in the text is calculated for each word and the stopwords candidates are determined. Then, using the fastText library, the vector of each word is obtained. In the next step, five neighbors are found for each vector. Next, by removing duplicate words, the final list of stopwords (1014 stopwords) is collected. The result of simulations show the accuracy of detecting stopwords by the k-nearest neighbor method is 94.6%.
Papers List
List of archived papers
Assessing Users' Influence on Respondents in Conversation Quality: A Quantitative Study on Reddit Based on the Cooperative Principle
Afsaneh Habibi - Fattaneh Taghiyareh
Beyond Appearance: Transformer-based Person Identification from Conversational Dynamics
Masoumeh Chapariniya - Teodora Vukovic - Sarah Ebling - Volker Dellwo
A Vision-Based Method for Human Activity Recognition Using Local Binary Pattern
Babak Goodarzi - Reza Javidan - Mohammad Sadegh Rezaei
Classification of Audio Streaming in Network Traffic Based on Machine Learning Methods
Mohammad Nikbakht - Mehdi Teimouri
An effective hybrid algorithm for locating splicing forgery image
Seyed Hesamoddin Hosseini - Amene Vatanparast - Amir Hossein Taherinia
A Comparative Analysis of Clinical Note Categories for Mortality Prediction in ICU Patients
Maryam Karrabi - Mohsen Kahani - Mina Afzali - Nadieh Armin
GroupRec: Group Recommendation by Numerical Characteristics of Groups in Telegram
Davod Karimpour - Mohammad Ali Zare Chahooki - Ali Hashemi
Real-Time Forecasting Using Mixed Frequency Time-Series Data
Armin Khayati - Mohammad Taheri - Koorush Ziarati
I-ACS: An Improved Ant Colony System to Solve the Time-Dependent Orienteering Problem
Zahra Bakhshandeh - Morteza Keshtkaran
Distilling Knowledge from CNN-Transformer Models for Enhanced Human Action Recognition
Hamid Ahmadabadi - Omid Nejati Manzari - Ahmad Ayatollahi
more
Samin Hamayesh - Version 43.7.0