I have a very interesting problem I have been trying to resolve in past few days without luck. I have the 120k descriptions of the items that I have to compare to 38k of items and determine what is the level of similarity between. Ultimately I want to see if any of 38k exist within 120k based on similarity. I found nice similarity script in excel and I organized my data as multiplication table so I can compare each description from 120k to each description in 38k. See pic below. So the function works, however, the amount of calculation is just not possible to run in excel. We are talking over 2 billion calculation if I split this in half ( 120k X 16k). The function is comparing description from A2 to B1, then A2 to C1 and so forth till the end which is 16k. Then it goes description from A3 and does the same and 120k times like that.
Does anyone know Script in SQL or R or Python that can do this if put this on the powerful server?