1

So I have a my data set currently looks like the following:

['microsoft','bizspark'],
['microsoft'],
['microsoft', 'skype'],
['amazon', 's3'],
['amazon', 'zappos'],
['amazon'],

.... etc.

Now what I would love to do is cluster these in regards to one another, using the Levenstein distance to calculate word scores.

Now I would iterate through all of the lists and compare the distance to the following lists.

microsoft -> ['microsoft','bizspark'], ['microsoft'], ['microsoft', 'skype'],
amazon -> ['amazon', 's3'], ['amazon', 'zappos'], ['amazon'], ....

The question is how to do this? Should I calculate each levenstein distance on a word by word basis i.e. for ['amazon', 'zappos'] and ['microsoft','bizspark'], I would firstly get pairs: (amazon, microsoft), (amazon, bizspark), (zappos, microsoft, (zappos, bizspark) and calculate the distance of each pair.

Or should I really just create strings from these and then calculate the distance?

What I should then end up with is an NXN matrix with the distance:

                            ['microsoft','bizspark'] | ['amazon', 'zappos'] ....
    ['microsoft','bizspark']           1             |          ?
    _-------------------------------------------------------------------------
    ['amazon', 'zappos']               ?             |          1
            ...
            ....

Then how do I apply clustering to this to determine a cut-off threshold?

One such suggestion using single words is discussed here

But I'm not sure how to go about it with regards to word lists!?

Please note, in regards to implementation I am using Python libaries, such as Numpy, Scipy, Pandas and as needed.

Community
  • 1
  • 1
redrubia
  • 2,256
  • 6
  • 33
  • 47
  • 2
    This might be a better fit for http://stats.stackexchange.com/ until you work out what it is you want to be implementing. – Andy Hayden Dec 18 '13 at 05:16
  • Check out the jellyfish library also. It provides other distance measures such as jaro-distance which I personally find more useful. – Woody Pride Dec 19 '13 at 06:51

2 Answers2

0

What you match against probably depends primarily on what your goals are. If you want to match either word, you probably should match against both words separately. If you want to match against phrases, then ' '.join()'ing them is probably a good idea.

BTW, I recently did some fuzzy matching using difflib.get_close_matches(). It's in the Python Standard Library. I don't have anything against whatever Levenstein distance library you may use; I just wanted to point out this option as one that worked for me.

dstromberg
  • 6,954
  • 1
  • 26
  • 27
  • I guess one of the main reasons is my own understanding how the Levenstein distance works, and I'd have to really get to grips and understand the validity and inner workings of the one you suggested - but I will take a look. I think it would be interesting maybe to combined a weighting and try and match on both words as well as string -> I wonder if this would provide more information or make it lose clarity – redrubia Dec 18 '13 at 04:22
  • However if I do deal with matching each word, then I'm not too sure as to how. Do I create pairs and instantly cluster when two words are the same, such as the microsoft one above? – redrubia Dec 18 '13 at 04:27
0

Maybe "frequent itemset mining" is more what you looking for than clustering.

It will find frequent word combinations, such that each document may be part of multiple patterns.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • I've just looked into this. I'm not sure it will help, mainly as words may be incorrectly spelt then not be matched - there is no correction aspect. Also each of the pairs shown above, actually have keys, which is basically the string such as 'Microsoft Bizspark London' which needs to be maintained. Also on another note, pairs may never be repeated and I assume by doing frequent itemset mining, I will lose things that may occur once... – redrubia Dec 18 '13 at 14:41
  • Try to be more clear in your question *what* you want to do then. I *would* consider rare word pairs to be noise, and discard them. – Has QUIT--Anony-Mousse Dec 18 '13 at 15:19