0

I am trying to subset data to create a list of possible duplicates in a new data frame. The problem is that the names are in different format and possible only a small part of the ID may actually match.

I need R to output a list of possible duplicates for me to then check

I've found a few examples for formatting issues or when the it's the first few characters that you are trying to match. I am not sure how to put the codes together and the characters that match may be anywhere in the name.

So far, this seems to get me the closest, but Im still not sure how to apply the code the work for me.

Subset a df using partial match with multiple criteria

This is what my data looks like (but with 1000000s of lines):

Supplier.Name Date.of.Record BMCC.avg
SG & JM Hammond     2018-07-21 292.2381
Mileshan Nominees Pty Ltd     2018-12-21 130.0000
RW & GJ Brown & Sons     2018-02-21 162.8333
BD & BA Smith     2018-02-21 478.0000

In the end,I would like a list of possible duplicates based on partial matches (maybe 4 or 5 characters in a row?)

Right now I can't seem to put together a code at all. Even a few starting point suggesting would be helpful. Thanks!

Cae.rich
  • 171
  • 7
  • Related post? https://stackoverflow.com/q/2231993/680068 – zx8754 Jul 16 '19 at 15:43
  • Can you provide the sample data in an easily-consumed format? Many prefer either the output from `dput(x)`, where `x` is either `head(mydat,n=10)` with the right number of rows, or a hand-picked sampling of rows in order to give us enough variability and duplicates to effect a match (and please don't include more columns than necessary to get the point across). Similarly, please provide the expected output of that sample data. Thanks! – r2evans Jul 16 '19 at 15:44

0 Answers0