Similarity Measure Algorithm for Text Document Clustering, Using Singular Value Decomposition
Valentina Adu *
ICT Directorate, Kumasi Technical University, P.O.BOX 854, Kumasi- Ghana.
Michael Donkor Adane
Department of Information Technology Akatsi College of Education, P. O. Box PMB, Akatsi- Ghana.
Kwadwo Asante
Department of Information Technology Education, Akenten Appiah-Menkam University of Skills Training and Entrepreneurial Development, Kumasi Technical University, P.O. Box 1277, Kumasi, Ghana.
*Author to whom correspondence should be addressed.
Abstract
We examined a similarity measure between text documents clustering. Data mining is a challenging field with more research and application areas. Text document clustering, which is a subset of data mining helps groups and organizes a large quantity of unstructured text documents into a small number of meaningful clusters. An algorithm which works better by calculating the degree of closeness of documents using their document matrix was used to query the terms/words in each document. We also determined whether a given set of text documents are similar/different to the other when these terms are queried. We found that, the ability to rank and approximate documents using matrix allows the use of Singular Value Decomposition (SVD) as an enhanced text data mining algorithm. Also, applying SVD to a matrix of a high dimension results in matrix of a lower dimension, to expose the relationships in the original matrix by ordering it from the most variant to the lowest.
Keywords: Data mining, similarity, term frequency, singular value decomposition, clustering