R : Fuzzy name match for variable size

Question

I have been working on matching the source set with master set of customer names while this can be achieved by using -adist in R but now I have been using 2 million of source set with 500k of master set, here we cant use the adist as it does not support long vectors, so I have chunked the data to small set now i have 70 k of source set and 20k of master set while here the data sets size varies and hence can not use adist as it doesn't support variable size of sets and I have tried with various other ways to achieve the same by amatch, pmatch, agrep but not much help, I have referred these sites which I found but couldn't find solution.

I have tried with levenshteinDist, levenshteinSim and jarowinkler but have problem implementing for huge dataframe , can i find solution for my data frame similar to this solution using jarowinkler for different size of sets

Couple of comments for you: (1) `adist()` definitely *does* support unequal vector sizes, e.g. `adist(letters[1:2],letters[1:3])`; it returns a matrix for the result. Unless I misunderstood your point? And (2) I don't think lack of support for long vectors is the only constraint here; running `adist()` on two vectors of length 2e6 and 5e5 respectively would result in a *crazily* huge matrix, completely impossible for most computer systems to handle. For example, I just tried it on my system, and got the following error message: "Error: cannot allocate vector of size 7450.6 Gb". — bgoldst, May 12 '15 at 08:10
Can you explain how you're planning on using the resulting distance numbers? For example, are you looking for the first match for each customer name that has a distance number below a certain upper threshold? If that's your logic, you should be able to get chunking to work with `adist()`, provided you keep the chunks reasonably small and immediately discard all but the best match for each customer name. — bgoldst, May 12 '15 at 08:13
@bgoldst that is the similar error i got , that's why i had to chunk the data into small sets and now it works , after long time but not efficient. it doesn not produce result expected. — KRU, May 12 '15 at 10:00

R : Fuzzy name match for variable size

0 Answers0