A Classification Framework of Identifying Major Documents With Search Engine Suggestions and Unsupervised Subtopic Clustering

Chen Zhao, Takehito Utsuro, Yasuhide Kawada

Source Title: International Journal of Cognitive Informatics and Natural Intelligence (IJCINI)15(4)

ISSN: 1557-3958|EISSN: 1557-3966|EISBN13: 9781799859857|DOI: 10.4018/IJCINI.20211001.oa42

MLA

Zhao, Chen, et al. "A Classification Framework of Identifying Major Documents With Search Engine Suggestions and Unsupervised Subtopic Clustering." IJCINI vol.15, no.4 2021: pp.1-15. http://doi.org/10.4018/IJCINI.20211001.oa42

APA

Zhao, C., Utsuro, T., & Kawada, Y. (2021). A Classification Framework of Identifying Major Documents With Search Engine Suggestions and Unsupervised Subtopic Clustering. International Journal of Cognitive Informatics and Natural Intelligence (IJCINI), 15(4), 1-15. http://doi.org/10.4018/IJCINI.20211001.oa42

Chicago

Zhao, Chen, Takehito Utsuro, and Yasuhide Kawada. "A Classification Framework of Identifying Major Documents With Search Engine Suggestions and Unsupervised Subtopic Clustering," International Journal of Cognitive Informatics and Natural Intelligence (IJCINI) 15, no.4: 1-15. http://doi.org/10.4018/IJCINI.20211001.oa42

Export Reference

Favorite Full-Issue Download

View Full Text HTML

View Full Text PDF

Abstract

This paper addresses the problem of automatic recognition of out-of-topic documents from a small set of similar documents that are expected to be on some common topic. The objective is to remove documents of noise from a set. A topic model based classification framework is proposed for the task of discovering out-of-topic documents. This paper introduces a new concept of annotated {\it search engine suggests}, where this paper takes whichever search queries were used to search for a page as representations of content in that page. This paper adopted word embedding to create distributed representation of words and documents, and perform similarity comparison on search engine suggests. It is shown that search engine suggests can be highly accurate semantic representations of textual content and demonstrate that our document analysis algorithm using such representation for relevance measure gives satisfactory performance in terms of in-topic content filtering compared to the baseline technique of topic probability ranking.