Arabic Authorship Attribution Using Synthetic Minority Over-Sampling Technique and Principal Components Analysis for Imbalanced Documents

Hassina Hadjadj, Halim Sayoud

Source Title: International Journal of Cognitive Informatics and Natural Intelligence (IJCINI)15(4)

ISSN: 1557-3958|EISSN: 1557-3966|EISBN13: 9781799859857|DOI: 10.4018/IJCINI.20211001.oa33

MLA

Hadjadj, Hassina, and Halim Sayoud. "Arabic Authorship Attribution Using Synthetic Minority Over-Sampling Technique and Principal Components Analysis for Imbalanced Documents." IJCINI vol.15, no.4 2021: pp.1-17. http://doi.org/10.4018/IJCINI.20211001.oa33

APA

Hadjadj, H. & Sayoud, H. (2021). Arabic Authorship Attribution Using Synthetic Minority Over-Sampling Technique and Principal Components Analysis for Imbalanced Documents. International Journal of Cognitive Informatics and Natural Intelligence (IJCINI), 15(4), 1-17. http://doi.org/10.4018/IJCINI.20211001.oa33

Chicago

Hadjadj, Hassina, and Halim Sayoud. "Arabic Authorship Attribution Using Synthetic Minority Over-Sampling Technique and Principal Components Analysis for Imbalanced Documents," International Journal of Cognitive Informatics and Natural Intelligence (IJCINI) 15, no.4: 1-17. http://doi.org/10.4018/IJCINI.20211001.oa33

Export Reference

Favorite Full-Issue Download

View Full Text HTML

View Full Text PDF

Abstract

Nowadays, dealing with imbalanced data represents a great challenge in data mining as well as in machine learning task. In this investigation, we are interested in the problem of class imbalance in Authorship Attribution (AA) task, with specific application on Arabic text data. This article proposes a new hybrid approach based on Principal Components Analysis (PCA) and Synthetic Minority Over-sampling Technique (SMOTE), which considerably improve the performances of authorship attribution on imbalanced data. The used dataset contains 7 Arabic books written by 7 different scholars, which are segmented into text segments of the same size, with an average length of 2900 words per text. The obtained results of our experiments show that the proposed approach using the SMO-SVM classifier, presents high performance in terms of authorship attribution accuracy (100%), especially with starting character-bigrams. In addition, the proposed method appears quite interesting by improving the AA performances in imbalanced datasets, mainly with function words.