0% Complete
Home
/
12th International Conference on Computer and Knowledge Engineering
Degarbayan-SC: A Colloquial Paraphrase Farsi Subtitles Dataset
Authors :
Mohammad Javad Aghajani
1
Mohammad Ali Keyvanrad
2
1- Maleke-ashtar University of Technology
2- Maleke-ashtar University of Technology
Keywords :
colloquial dataset،deep learning،movie subtitles،Natural Language Processing،paraphrase detection،paraphrase generation،Persian dataset،supervised learning
Abstract :
Paraphrase generation and paraphrase detection are important tasks in Natural Language Processing (NLP), such as information retrieval, text simplification, question answering, and chatbots. The lack of comprehensive datasets in the Persian paraphrase is a major obstacle to progress in this area. In spite of their importance, no large-scale corpus has been made available so far, given the difficulties in its creation and the intensive labor required. In this paper, the construction process of Degarbayan-SC using movie subtitles, together with some of the difficulties we experienced during data extraction and sentence alignment, is addressed. As you know, movie subtitles are in Colloquial language. It is different from formal language. To the best of our knowledge, Degarbayan-SC is the first freely released large-scale (in the order of a million words) Persian paraphrase corpus. Furthermore, this newly introduced dataset will help the growth of Persian paraphrase. We have tested our dataset on neural network models and compared the performances of different attention-based models (transformers) and the GRU model on it. We have also declared the sentences generated by the neural networks and performed human metrics on them.
Papers List
List of archived papers
Reversible Data Insertion in Encryption Domain Based on Reduced Quad Difference Expansion
Alireza Ghaemi - Mohammad Zare Ehteshami - Amirhossein Ghaemi
InfOnto: An ontology for fashion influencer marketing based on Instagram
Somaye Sultani - Mohsen Kahani
Attention Transfer in Self-Regulated Networks for Recognizing Human Actions from Still Images
Masoumeh Chapariniya - Sara Vesali Barazande - Seyed Sajad Ashrafi - Shahriar B.Shokouhi
TriMAE: Fashion visual search with Triplet Masked Auto Encoder Vision Transformer
Lachin Zamani - Reza Azmi
Iris Detection and Segmentation Using Deep Learning
Ali Khaki - Ali Aghagolzadeh - Bagher Rahimpour Cami
An Efficient Approach for Breast Abnormality Detection through High-Level Features of Thermography Images
Farhad Abedinzadeh Torghabeh - Yeganeh Modaresnia - Seyyed Abed Hosseini
Reliability Evaluation of 4:2 Compressors Based on Hammock Networks
Farshad Safaei - Mohammad mahdi Emadi Kouchak - Sara Talebpour
Lossless Watermarking in Encrypted Triangular Mesh Models Based on Optimized Vertex Estimation and Error Histogram Shifting
Alireza Ghaemi - Habibollah Danyali - Kamran Kazemi - Zahra Qodrati - Amirhossein Ghaemi - Seyedeh Masoumeh Taji
A Genetic-based Fusion Approach of Persian and Universal Phonetic results for Spoken Language Identification
Ashkan Moradi - Yasser Shekofteh - Saeed Zarei
FAST: FPGA Acceleration of Neural Networks Training
Alireza Borhani - Mohammad Hossein Goharinejad - Hamid Reza Zarandi
more
Samin Hamayesh - Version 42.2.1