0

I have been working on matching the source set with master set of customer names while this can be achieved by using -adist in R but now I have been using 2 million of source set with 500k of master set, here we cant use the adist as it does not support long vectors, so I have chunked the data to small set now i have 70 k of source set and 20k of master set while here the data sets size varies and hence can not use adist as it doesn't support variable size of sets and I have tried with various other ways to achieve the same by amatch, pmatch, agrep but not much help, I have referred these sites which I found but couldn't find solution.

I have tried with levenshteinDist, levenshteinSim and jarowinkler but have problem implementing for huge dataframe , can i find solution for my data frame similar to this solution using jarowinkler for different size of sets

Community
  • 1
  • 1
KRU
  • 291
  • 4
  • 18
  • Couple of comments for you: (1) `adist()` definitely *does* support unequal vector sizes, e.g. `adist(letters[1:2],letters[1:3])`; it returns a matrix for the result. Unless I misunderstood your point? And (2) I don't think lack of support for long vectors is the only constraint here; running `adist()` on two vectors of length 2e6 and 5e5 respectively would result in a *crazily* huge matrix, completely impossible for most computer systems to handle. For example, I just tried it on my system, and got the following error message: "Error: cannot allocate vector of size 7450.6 Gb". – bgoldst May 12 '15 at 08:10
  • Can you explain how you're planning on using the resulting distance numbers? For example, are you looking for the first match for each customer name that has a distance number below a certain upper threshold? If that's your logic, you should be able to get chunking to work with `adist()`, provided you keep the chunks reasonably small and immediately discard all but the best match for each customer name. – bgoldst May 12 '15 at 08:13
  • @bgoldst that is the similar error i got , that's why i had to chunk the data into small sets and now it works , after long time but not efficient. it doesn not produce result expected. – KRU May 12 '15 at 10:00

0 Answers0