I have a set of about 7 Million phrases to be matched to about 300 Million queries.
Queries can be sub-strings or contain the phrases themselves. Basically I want a measure of 'similarity' between two phrases [ not necessarily the edit distance ]
Can someone give some pointers to efficient algorithms to do this. I would prefer distributed algorithms since I am going to do this on Hadoop via streaming using python.