0% Complete
Home
/
11th International Conference on Computer and Knowledge Engineering
A Language-Independent Approach to Classification of Textual File Fragments: Case Study of Persian, English, and Chinese Languages
Authors :
Fatemeh Mansouri Hanis
1
Hamidreza Khoshvaghti
2
Mehdi Teimouri
3
Hadi Veisi
4
1- University of Tehran
2- University of Tehran
3- University of Tehran
4- University of Tehran
Keywords :
Classification, File fragments, language-dependent file type identification, textual files, context language, file format.
Abstract :
With the advent of communications systems in recent decades, the transmission of electronic files on computer networks has dramatically increased. In this situation, identifying the type of files is important in many applications such as digital forensics and file carving. The state-of-the-art methods for identifying the file type of a file fragment are based on the content of the fragments. To the best of the authors' knowledge, there is no study addressing the effect of context language in identifying the file type of textual file fragments. In this paper, we have considered a machine learning approch for the classification among five types of common text file formats: PDF, DOC, DOCX, RTF, and TXT. Also, we have examined the effect of context language on the classification of the file fragments. Two scenarios are considered. In the first one, the language for both training and testing phases are the same, that the best results are achieved; the accuracies of the test for Persian, English, and Chinese languages are 85.6%, 76.4%, 86.1%, respectively. In the second scenario, the languages of training and testing sets are not the same, in which the training is done using one language and the evaluation is performed on the two other languages. In this case, the average accuracy values for Persian, English, and Chinese languages are 60.0%, 58.5%, and 71.4%, respectively. The evaluations of the second scenario show that the language-independent machine learning approach is robust in the identification of DOC, DOCX, and RTF formats.
Papers List
List of archived papers
Energy Efficient Power Allocation in MIMO-NOMA Systems with ZF Receiver Beamforming in Multiple Clusters
Mahdi Nangir - Abdolrasoul Sakhaei Gharagezlou - Nima Imani
Using Deep Learning for Classification of Lung Cancer on CT Images in Ardabil Province
Mohammad Ali Javadzadeh Barzaki - Jafar Abdollahi - Mohammad Negaresh - Maryam Salimi - Hadi Zolfeghari - Mohsen Mohammadi - Asma Salmani - Rona Jannati - Firouz Amani
FinTNet: From Tweets to Trades
Dorsa Tavakoli - Saman Haratizadeh
An effective hybrid algorithm for locating splicing forgery image
Seyed Hesamoddin Hosseini - Amene Vatanparast - Amir Hossein Taherinia
Brain Age Estimation with Twin Vision Transformer using Hippocampus Information Applicable to Alzheimer Dementia Diagnosis
Zahra Qodrati - Seyedeh Masoumeh Taji - Amirhossein Ghaemi - Habibollah Danyali - Kamran Kazemi - Alireza Ghaemi
Innovative Customer Segmentation based on Multi-Step Sequential Deep Clustering in the Telecommunication Industry
Fatemeh Jalali Farahani - Shima Tabibian
PersianILP: Construction and Evaluation of a Standard Persian Dataset for Inductive Link Prediction
Mohammad Rahimi - Afsaneh Fatemi - Ahmad Baraani
FarCQA: A Farsi Community Dataset for Question Classification and Answer Selection
Saba Emami - Maedeh Mosharraf
AL-YOLO: Accurate and Lightweight Vehicle and Pedestrian Detector in Foggy Weather
Behdad Sadeghian Pour - Hamidreza Mohammadi Jozani - Shahriar Baradaran Shokouhi
ParsHomo: A T5-Powered Approach to High-Precision Persian Homograph Disambiguation
Hasan Jalali - Taha Mohaddesi
more
Samin Hamayesh - Version 43.7.0