4

I have a set of about 7 Million phrases to be matched to about 300 Million queries.

Queries can be sub-strings or contain the phrases themselves. Basically I want a measure of 'similarity' between two phrases [ not necessarily the edit distance ]

Can someone give some pointers to efficient algorithms to do this. I would prefer distributed algorithms since I am going to do this on Hadoop via streaming using python.

Rohan Monga
  • 1,759
  • 1
  • 19
  • 31
  • I don't understand what you are trying to do. What are the requirements for a "match" between a phrase and a query? Is it simply that the phrase has to be a substring of the query? What kind of output do you want to get from this algorithm? – David Grayson Feb 21 '11 at 04:35
  • I have updated the question with more detail. – Rohan Monga Feb 21 '11 at 05:17

2 Answers2

2

Bed trees look interesting

Bed-Tree: An All-Purpose Index Structure for String Similarity Search Based on Edit Distance (Pdf of presentation)

Community
  • 1
  • 1
Martin DeMello
  • 11,876
  • 7
  • 49
  • 64
1

This a at least not very trivial, because you have on the one side very much data and on the other side even more.

The ultra simplest approach would be a lucene index on the 7 mio. phrases and let the hadoop job query the index. Not quite sure if you need a solr server for that, or any similar implementations in python.

The mapper should write out the phrase id or linenumber, whatever you have to indentify it. Or at least the phrase itself, along with the matchingscore.

In the reduce step you could go for a reducing on a phrase key and write out all the related phrases with the score. (or whatever you want to)
For similarity you can read further here:

Similarity of Apache Lucene
Apache Lucene itself

Thomas Jungblut
  • 20,854
  • 6
  • 68
  • 91