How to efficiently find similar documents

Question

I have lots of document that I have clustered using a clustering algorithm. In the clustering algorithm, each document may belong to more than one clusters. I've created a table storing the document-clusterassignment and another one which stores the cluster-document info. When I look for the list of similar documents to a given document (let's sat d_i). I first retrieve the list of clusters to which it belongs (from the document-cluster table) and then for each cluster c_j in the document-cluster I retrieve the lists of documents which belong to c_j from the cluster-document table. There are more than one c_j, so obviously there will be in multiple lists. Each list have many documents and apparently there might be overlaps among these lists.

In the next phase and in order to find the most similar documents to d_i, I rank the similar documents based on the number of clusters they have in common with d_i.

My question is about the last phase. A naive solution is to create a sorted kind of HashMap which has the document as the key and # common clusters as the value. However as each list might contains many many documents, this may not be the best solution. Is there any other way to rank the similar items? Any preprocessing or ..?

Define similar. Are you looking for something semantically similar, or for a near duplicate? If the later, dupe of: http://stackoverflow.com/q/23053688/572670 — amit, May 04 '15 at 21:09
Personally, it would help me a lot if you could add *details* to your question. (Keep this question, and add another question to the same thread.) Spell-out in more detail what you've done. "I'm from Missouri ... *show* me." Then, it will be much easier for your question to get the good answers that it needs. — Mike Robinson, May 04 '15 at 21:38
Let me summarize my question: I have multiple arrays of large size. The elements of the arrays have overlap (e.g. an element may exist in more than one array). I would like to create a sorted list of the elements of all the arrays and the sorting criterion is the number of appearances of each element. The easiest way is to create a sorted hashmap but this may not be very efficient considering the size of the arrays. Any better approach? — HHH, May 04 '15 at 22:10

score 1 · Answer 1 · answered May 05 '15 at 06:33

Assuming that the number of arrays is relatively small comparing to the number of elements (and in particular, the number of arrays is in o(logn)), you can do it by a modification of a bucket sort:

Let m be the number of arrays create a list containing m buckets buckets[], where each bucket[i] is a hashset

for each array arr:
   for each element x in arr:
      find if x is in any bucket, if so - let that bucket id be i:
          remove x from bucket i  
          i <- i + 1
      If no such bucket exist, set i=1
      add x to bucket i

for each bucket i=m,m-1,...,1 in descending order:
   for each element x in bucket[i]:
      yield x

The above runs in O(m^2*n):

Iterating over each array
Iterating over all elements in each array
Finding the relevant bucket.

Note that the last one can be done by adding a map:element->bucket_id, and be done in O(1) using hash tables, so we can improve it to O(m*n).

An alternative is to use a hashmap as a histogram that maps from element to its number of occurances, and then sort the array including all elements based on the histogram. The benefit of this approach: it can be distributed very nicely with map-reduce:

map(partial list of elements l):
    for each element x:
       emit(x,'1')
reduce(x, list<number>):
   s = sum{list}
   emit(x,s)
combine(x,list<number>):
   s = sum{list} //or size{list} for a combiner
   emit(x,s)

Is this much better than the naive method of an `O(n)` pass to create the histogram (a hash map), and then an `O(n log n)` sort? If m is in `o(log n)`, then there's no *asymptotic* difference between `O(m*n)` and `O(n log n)`. The real world running time, of course, could be much different. I don't have a good feel for your algorithm's complexity to make that judgement. — Jim Mischel, May 05 '15 at 07:11
@JimMischel (1) there is. It is small-o, not big-O notation. (2) Note that in this case `n` is the average number of elements per list, not the total number of elements, so if #uniques is high, the naive complexity is `O(nmlog(nm)) ~= O(nmlog(n)` assuming `m < n`, so you still save up a factor of `logn`. Intuitively, note that the bucket sort is linear in the size of the data, assuming `m — amit, May 05 '15 at 07:17

How to efficiently find similar documents

1 Answers1