Similarity Analysis in SQL or Python or R

Question

I have a very interesting problem I have been trying to resolve in past few days without luck. I have the 120k descriptions of the items that I have to compare to 38k of items and determine what is the level of similarity between. Ultimately I want to see if any of 38k exist within 120k based on similarity. I found nice similarity script in excel and I organized my data as multiplication table so I can compare each description from 120k to each description in 38k. See pic below. So the function works, however, the amount of calculation is just not possible to run in excel. We are talking over 2 billion calculation if I split this in half ( 120k X 16k). The function is comparing description from A2 to B1, then A2 to C1 and so forth till the end which is 16k. Then it goes description from A3 and does the same and 120k times like that.

Does anyone know Script in SQL or R or Python that can do this if put this on the powerful server?

You could use Jaro Winkler, which is a good match for your data, as it appears to be product names? Here's a link to an implementaiton for Python: https://pypi.python.org/pypi/pyjarowinkler, but this is also in R, and there are SQL versions as well. I wrote a version in C# to match literally millions of product names, and got it all done in around 4-5 hours. I had to put my laptop in a fridge though, as it was starting to overheat by the end! — Richard Hansell, Mar 14 '18 at 11:36
@RichardHansell thank you for the suggestion. I am trying to figure out how to run this in R right now. :-) — Slavisha, Mar 14 '18 at 14:37
You might find this is pretty similar to your requirement (two lists to work with - when usually there is just one list): https://stackoverflow.com/questions/29102155/r-string-fuzzy-matching-using-jarowinkler — Richard Hansell, Mar 14 '18 at 20:11

score 1 · Answer 1 · answered Mar 14 '18 at 12:59

You are looking for aproximate string matching. There is a free add-on for Excel, developed by Microsoft to create a so called Fuzzy match. It uses the Jaccard index algorithm to determine the similarity of two given values.

Make sure that both lists of descriptions are listed in a sortable table column (Ctrl+L);
Link the columns in the 'Left Columns' and the 'Right Columns' section by clikcing on them and press the connect button in the middle;
Select which columns you want as output (hold Ctrl if you want to select multiple columns on either the left or the right side);
Make sure the FuzzyLookup.Similarity is checked, this will give the similarity score between the values 0-1;
Determine the maximum number of matches shown per comparable string;
Determine your Threshold. The number represents the minimum percentage of similarity between two strings before it marks it as a match;
Go to a new sheet to cell A1, that's because the new generated similarity table will overwrite current data;
Hit the 'Go'button!
Select all the similarity scores and give them more decimals for a proper result.

See example.

thank you for the response. I tried this solution however it is very slow even on my server. I estimate that it might take few days to finish. — Slavisha, Mar 14 '18 at 14:38

Similarity Analysis in SQL or Python or R

1 Answers1