So I have a my data set currently looks like the following:
['microsoft','bizspark'],
['microsoft'],
['microsoft', 'skype'],
['amazon', 's3'],
['amazon', 'zappos'],
['amazon'],
.... etc.
Now what I would love to do is cluster these in regards to one another, using the Levenstein distance to calculate word scores.
Now I would iterate through all of the lists and compare the distance to the following lists.
microsoft -> ['microsoft','bizspark'], ['microsoft'], ['microsoft', 'skype'],
amazon -> ['amazon', 's3'], ['amazon', 'zappos'], ['amazon'], ....
The question is how to do this? Should I calculate each levenstein distance on a word by word basis i.e. for ['amazon', 'zappos'] and ['microsoft','bizspark'], I would firstly get pairs: (amazon, microsoft), (amazon, bizspark), (zappos, microsoft, (zappos, bizspark) and calculate the distance of each pair.
Or should I really just create strings from these and then calculate the distance?
What I should then end up with is an NXN matrix with the distance:
['microsoft','bizspark'] | ['amazon', 'zappos'] ....
['microsoft','bizspark'] 1 | ?
_-------------------------------------------------------------------------
['amazon', 'zappos'] ? | 1
...
....
Then how do I apply clustering to this to determine a cut-off threshold?
One such suggestion using single words is discussed here
But I'm not sure how to go about it with regards to word lists!?
Please note, in regards to implementation I am using Python libaries, such as Numpy, Scipy, Pandas and as needed.