Match values in two different data frames, that are not identical, with matching pattern in R

Question

I am a beginner at R and I try to figure out what limits this remarkable (sometimes nerv-wracking) program have.

Here is my problem: I have two data frames(df's) from two different files of raw data. In these two data frames I have columns with ID numbers for individuals. I know how to merge these to df's together by ID. There problem is that the person who registered the ID numbers in one of the data frames have typed some of the ID numbers incorrect. For example, the ID is supposed to looks like this: NK-02-0028. But its typed in like this: NK-020028.

Hence, The ID's won't match when I merge these two data frames. If I had data frames with only 10 observations it wouldn't been such a big problem but I have approx. 8000 observations in one df and 355 in the other. The correct IDs are in the df with 355 obs and the wrong ones are in the df with 8000 obs. I want to match the ID numbers in the df with 355 observations based on the 4 last digits to see how many and what matches I get to see if there even are any matches.

Is this possible? Hopefully someone can help me and understands my problem I got here.

So you need some way of "normalising" the ID numbers? How many mistakes are there? Are they all 2 letters + numbers? If you just stripped out all the "-" chars would everything match okay? How many different ways has the data been incorrectly input? I think we need some more examples. — Spacedman, Mar 07 '16 at 16:35
well, some IDs don't have letters so they look for example like this "00-0354". These seems to be inserted correctly but IDs that have letters, such as "NK-02-0028" are often registered wrong. I found several different examples of mistakes like NK-020028 (without a dash) or N-K02-0028 (the dash is in the wrong place. Sometimes there is a 0 instead of a dash! But the 4 last digits seems to be correct so if I can match the df with correct IDs with the df with incorrect IDs based on the last 4 digits to se what matches I get? — Rbeginner, Mar 07 '16 at 16:46

score 1 · Answer 1 · edited May 23 '17 at 12:24

If the problem is only that there are some stripes, than you coud just remove them from both columns and merge them based on these "new" IDs.

For example (which would also remove spaces, which might be practical as well);

df1$ID.new=gsub(' |-', '', df1$ID)
df2$ID.new=gsub(' |-', '', df1$ID)
merge(df1,df1,by="ID.new")

Or after having read your last comment (using the code from here: Extracting the last n characters from a string in R):

substrRight <- function(x, n){
  substr(x, nchar(x)-n+1, nchar(x))
}

df1$ID.new=substrRight(df1$ID,4)
df2$ID.new==substrRight(df2$ID,4)
merge(df1,df1,by="ID.new")

Thank you very much! I will try this out! – Rbeginner Mar 07 '16 at 17:25 — Rbeginner, Mar 07 '16 at 17:25

Match values in two different data frames, that are not identical, with matching pattern in R

1 Answers1