0% Complete
Home
/
12th International Conference on Computer and Knowledge Engineering
Degarbayan-SC: A Colloquial Paraphrase Farsi Subtitles Dataset
Authors :
Mohammad Javad Aghajani
1
Mohammad Ali Keyvanrad
2
1- Maleke-ashtar University of Technology
2- Maleke-ashtar University of Technology
Keywords :
colloquial dataset،deep learning،movie subtitles،Natural Language Processing،paraphrase detection،paraphrase generation،Persian dataset،supervised learning
Abstract :
Paraphrase generation and paraphrase detection are important tasks in Natural Language Processing (NLP), such as information retrieval, text simplification, question answering, and chatbots. The lack of comprehensive datasets in the Persian paraphrase is a major obstacle to progress in this area. In spite of their importance, no large-scale corpus has been made available so far, given the difficulties in its creation and the intensive labor required. In this paper, the construction process of Degarbayan-SC using movie subtitles, together with some of the difficulties we experienced during data extraction and sentence alignment, is addressed. As you know, movie subtitles are in Colloquial language. It is different from formal language. To the best of our knowledge, Degarbayan-SC is the first freely released large-scale (in the order of a million words) Persian paraphrase corpus. Furthermore, this newly introduced dataset will help the growth of Persian paraphrase. We have tested our dataset on neural network models and compared the performances of different attention-based models (transformers) and the GRU model on it. We have also declared the sentences generated by the neural networks and performed human metrics on them.
Papers List
List of archived papers
A New Application of Machine Learning Based Methods for Disk Space Variation Fault Diagnosis in Transformer Windings
Reza Behkam - Amir Lotfi - Gevork B. Gharehpetian
LLM-Driven AutoML for Cross-Lingual Handwritten OCR: Closed-Loop Neural Architecture Search with GPT-5, GPT-4o, and Claude Sonnet 4
Mobina Kashaniyan - Amirhossein Ghassemi - Nasser Mozayani
DTranIDS: A Two-Tiered Intrusion Detection System for RPL-based IoT Networks based on Decision Tree and Transformer Models
Mohammad Fazeli - Mohsen Raji - Mohammad Mahdi Fazeli
Enhanced Principal-curve based Classifiers for Time-series Label Prediction
Seyed Aref Hakimzadeh - Koorush Ziarati
Enhancing Cloud Security with Federated CNN-LSTM: A Novel Approach to Intrusion Detection
Reyhaneh Ilaghi - Raheleh Ilaghi - Fereshteh Rahmani - Seyyed hamid Ghafoori
Enhanced Melanoma Detection: An Improved Deformable DETR Model with Efficient Channel Attention
Amirreza Rouhbakhshmeghrazi - Shayan Nalbandian - Sheida Shadman - Mohammad Reza Hassannezhad - Shuyuan Yang - Bo Li
An Evolutionary Approach with Surrogate Models for Feature Selection in Intrusion Detection Systems
Sadeq Moradi - Hadi Shahriar Shahhoseini
Smart Home Connectivity: Identifying the Best IoT Application Layer Protocols
Hossein Shahinzadeh - Zohreh Azani - Sundus F. Al-Hameedawi - S. Mohammadali Zanjani - Saiedeh Mehrabani-Najafabadi - Mohammadreza Hemmati
Identification of Botnets and Nodes Attacking Smart Cities by Majority Voting Mechanism and Feature Selection
Maliheh Araghchi - Nazbanoo Farzaneh
Improvement of Credit Scoring by LSTM Autoencoder Model
Milad Sattari Maleki - Seyedeh Niusha Motevallian - Faezehsadat Hosseini - Mohammad Sabokrou - Hamidreza Soltanalizadeh Maleki
more
Samin Hamayesh - Version 43.7.0