I have lots of document that I have clustered using a clustering algorithm. In the clustering algorithm, each document may belong to more than one clusters. I've created a table storing the document-cluster
assignment and another one which stores the cluster-document
info. When I look for the list of similar documents to a given document (let's sat d_i
). I first retrieve the list of clusters to which it belongs (from the document-cluster
table) and then for each cluster c_j in the document-cluster
I retrieve the lists of documents which belong to c_j from the cluster-document
table. There are more than one c_j, so obviously there will be in multiple lists. Each list have many documents and apparently there might be overlaps among these lists.
In the next phase and in order to find the most similar documents to d_i, I rank the similar documents based on the number of clusters they have in common with d_i.
My question is about the last phase. A naive solution is to create a sorted kind of HashMap which has the document as the key and # common clusters as the value. However as each list might contains many many documents, this may not be the best solution. Is there any other way to rank the similar items? Any preprocessing or ..?