A novel term weighting scheme for text classification: TF-MONO

DOĞAN, TURGUT; Uysal, Alper

doi:10.1016/j.joi.2020.101076

A novel term weighting scheme for text classification: TF-MONO

DOĞAN T., Uysal A. K.

Journal of Informetrics, cilt.14, sa.4, 2020 (SCI-Expanded, SSCI, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 14 Sayı: 4
Basım Tarihi: 2020
Doi Numarası: 10.1016/j.joi.2020.101076
Dergi Adı: Journal of Informetrics
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Social Sciences Citation Index (SSCI), Scopus, Applied Science & Technology Source, Compendex, Computer & Applied Sciences, INSPEC, Library and Information Science Abstracts, Library Literature and Information Science
Anahtar Kelimeler: Max-occurrence, Non-occurrence, Supervised term weighting, Text classification
Trakya Üniversitesi Adresli: Evet

Özet

The effective representation of the relationship between the documents and their contents is crucial to increase classification performance of text documents in the text classification. Term weighting is a preprocess aiming to represent text documents better in Vector Space by assigning proper weights to terms. Since the calculation of the appropriate weight values directly affects performance of the text classification, in the literature, term weighting is still one of the important sub-research areas of text classification. In this study, we propose a novel term weighting (MONO) strategy which can use the non-occurrence information of terms more effectively than existing term weighting approaches in the literature. The proposed weighting strategy also performs intra-class document scaling to supply better representations of distinguishing capabilities of terms occurring in the different quantity of documents in the same quantity of class. Based on the MONO weighting strategy, two novel supervised term weighting schemes called TF-MONO and SRTF-MONO were proposed for text classification. The proposed schemes were tested with two different classifiers such as SVM and KNN on 3 different datasets named Reuters-21578, 20-Newsgroups, and WebKB. The classification performances of the proposed schemes were compared with 5 different existing term weighting schemes in the literature named TF-IDF, TF-IDF-ICF, TF-RF, TF-IDF-ICSDF, and TF-IGM. The results obtained from 7 different schemes show that SRTF-MONO generally outperformed other schemes for all three datasets. Moreover, TF-MONO has promised both Micro-F1 and Macro-F1 results compared to other five benchmark term weighting methods especially on the Reuters-21578 and 20-Newsgroups datasets.