Combining Distributed Word Representation and Document Distance for Short Text Document Clustering


Supavit Kongwudhikunakorn, Kitsana Waiyamai, Journal of Information Processing Systems Vol. 16, No. 2, pp. 277-300, Apr. 2020  

https://doi.org/10.3745/JIPS.04.0164
Keywords: Document Clustering, Document Distance, Short Text Documents, Short Text Document Clustering
Fulltext:

Abstract

"This paper presents a method for clustering short text documents, such as news headlines, social media statuses, or instant messages. Due to the characteristics of these documents, which are usually short and sparse, an appropriate technique is required to discover hidden knowledge. The objective of this paper is to identify the combination of document representation, document distance, and document clustering that yields the best clustering quality. Document representations are expanded by external knowledge sources represented by a Distributed Representation. To cluster documents, a K-means partitioning-based clustering technique is applied, where the similarities of documents are measured by word mover’s distance. To validate the effectiveness of the proposed method, experiments were conducted to compare the clustering quality against several leading methods. The proposed method produced clusters of documents that resulted in higher precision, recall, F1- score, and adjusted Rand index for both real-world and standard data sets. Furthermore, manual inspection of the clustering results was conducted to observe the efficacy of the proposed method. The topics of each document cluster are undoubtedly reflected by members in the cluster."


Statistics
Show / Hide Statistics

Statistics (Cumulative Counts from November 1st, 2017)
Multiple requests among the same browser session are counted as one view.
If you mouse over a chart, the values of data points will be shown.




Cite this article
[APA Style]
Kongwudhikunakorn, S. & Waiyamai, K. (2020). Combining Distributed Word Representation and Document Distance for Short Text Document Clustering. Journal of Information Processing Systems, 16(2), 277-300. DOI: 10.3745/JIPS.04.0164.

[IEEE Style]
S. Kongwudhikunakorn and K. Waiyamai, "Combining Distributed Word Representation and Document Distance for Short Text Document Clustering," Journal of Information Processing Systems, vol. 16, no. 2, pp. 277-300, 2020. DOI: 10.3745/JIPS.04.0164.

[ACM Style]
Supavit Kongwudhikunakorn and Kitsana Waiyamai. 2020. Combining Distributed Word Representation and Document Distance for Short Text Document Clustering. Journal of Information Processing Systems, 16, 2, (2020), 277-300. DOI: 10.3745/JIPS.04.0164.