Reference Hub3
Arabic Authorship Attribution Using Synthetic Minority Over-Sampling Technique and Principal Components Analysis for Imbalanced Documents

Arabic Authorship Attribution Using Synthetic Minority Over-Sampling Technique and Principal Components Analysis for Imbalanced Documents

Hassina Hadjadj, Halim Sayoud
Copyright: © 2021 |Volume: 15 |Issue: 4 |Pages: 17
ISSN: 1557-3958|EISSN: 1557-3966|EISBN13: 9781799859857|DOI: 10.4018/IJCINI.20211001.oa33
Cite Article Cite Article

MLA

Hadjadj, Hassina, and Halim Sayoud. "Arabic Authorship Attribution Using Synthetic Minority Over-Sampling Technique and Principal Components Analysis for Imbalanced Documents." IJCINI vol.15, no.4 2021: pp.1-17. http://doi.org/10.4018/IJCINI.20211001.oa33

APA

Hadjadj, H. & Sayoud, H. (2021). Arabic Authorship Attribution Using Synthetic Minority Over-Sampling Technique and Principal Components Analysis for Imbalanced Documents. International Journal of Cognitive Informatics and Natural Intelligence (IJCINI), 15(4), 1-17. http://doi.org/10.4018/IJCINI.20211001.oa33

Chicago

Hadjadj, Hassina, and Halim Sayoud. "Arabic Authorship Attribution Using Synthetic Minority Over-Sampling Technique and Principal Components Analysis for Imbalanced Documents," International Journal of Cognitive Informatics and Natural Intelligence (IJCINI) 15, no.4: 1-17. http://doi.org/10.4018/IJCINI.20211001.oa33

Export Reference

Mendeley
Favorite Full-Issue Download

Abstract

Nowadays, dealing with imbalanced data represents a great challenge in data mining as well as in machine learning task. In this investigation, we are interested in the problem of class imbalance in Authorship Attribution (AA) task, with specific application on Arabic text data. This article proposes a new hybrid approach based on Principal Components Analysis (PCA) and Synthetic Minority Over-sampling Technique (SMOTE), which considerably improve the performances of authorship attribution on imbalanced data. The used dataset contains 7 Arabic books written by 7 different scholars, which are segmented into text segments of the same size, with an average length of 2900 words per text. The obtained results of our experiments show that the proposed approach using the SMO-SVM classifier, presents high performance in terms of authorship attribution accuracy (100%), especially with starting character-bigrams. In addition, the proposed method appears quite interesting by improving the AA performances in imbalanced datasets, mainly with function words.