3

How do I match two different data frames when the values I am comparing are not exactly the same?

I was thinking of using merge() but I am not sure.

Table1:

ID           Value.1
10001        x
18273-9      y
12824/5/6/7  z
10283/5/9    d

Table2:

ID           Value.2
10001        a
18274        b
12826        c
10289        u

How do I merge Table 1 and 2 based on ID?

Which specific function of fuzzyjoin package would I use, especially with the "/" & "-" cases? How do I expand the "-" case from 18273-9 so that R will register 18273 / 18274 / 18275 / ...?

tom
  • 725
  • 4
  • 17
chu-js
  • 181
  • 7

3 Answers3

1

You can write a function to extract the corresponding sequences from the strings containing "/" or "-" and recombine them into a new data.frame as follows:

df1 <- data.frame(ID=c("10001","18273-9","15273-8", "15170-4",  "12824/5/6/7","10283/5/9"), 
                  value=c("a","c","c", "d","k", "l"), stringsAsFactors = F)

df2 <- data.frame(ID=c("10001","18274","12826","10289"), 
                  value=c("o","p","q","r"), stringsAsFactors = F)

doIt <- function(df){
  listAsDF <- function(l) {
    x <- stack(setNames(l, temp$value))
    names(x) <- c("ID", "value")
    return(x)
  }
  Base <- df[!grepl("\\/", df$ID) & !grepl("\\-", df$ID), ]
  #1 cases when - present
  temp <- df[grep("\\-", df$ID),]
  temp <- listAsDF(lapply(strsplit(temp$ID, "-"), function(e) seq(e[1], paste0(strtrim(e[1], nchar(e[1])-1), e[2]), 1)))
  Base <- rbind(Base, temp)
  #2 cases when / present
  temp <- df[grep("\\/", df$ID),]
  temp <- listAsDF(lapply(strsplit(temp$ID, "/"), function(a) c(a[1], paste0(strtrim(a[1], nchar(a[1])-1), a[-1]))))
  Base <- rbind(Base, temp)
  return(Base)
}

Then you can mergge the df2 and df1:

merge(doIt(df1), df2, by = "ID", all.x = T)

Hope this helps!

Patrik_P
  • 3,066
  • 3
  • 22
  • 39
  • This has been really helpful! May I know why you use setNames with x and temp? Thank you! – chu-js Jul 18 '19 at 03:31
  • `setNames` gives names to an object and returns the object itself, this is what I was going for as I need the names further in the stack function. – Patrik_P Jul 18 '19 at 06:40
  • Do you also mind explaining the arguments in `strtrim`? I googled it and there are only supposed to be two — `strtrim(x, width)`? Thank you so much!! Sorry for bugging you, I'm just starting out. – chu-js Jul 18 '19 at 08:23
  • `strtrim` is trimming a string to desired width, like you say it takes two args, if you take the code apart, you see, that is exactly what is going on. `x` is `e[1]`/`a[1]` and `width` is `nchar(e[1])-1`/`nchar(a[1])-1` – Patrik_P Jul 18 '19 at 08:35
0

You could use the fuzzy string matching function "agrep" from base R.

df1 <- data.frame(ID=c("10001","18273-9","12824/5/6/7","10283/5/9"), 
                  value=c("a","c","d","k"))

df2 <- data.frame(ID=c("10001","18274","12826","10289"), 
                  value=c("o","p","q","r"))

apply(df1, 1, function(x) agrep(x["ID"], df2$ID, max = 3.5))

As you see it struggles to find the match for row 4. So it might make sense to clean your ID variable (e.g., take out the "/") before running agrep.

Dominix
  • 433
  • 3
  • 9
0

One option could consist in extracting the format of ID you want to keep. And then do your merge.

You can format your ID column as follow :

library(stringr) 
library(dplyr) 

If you want only the digits before any symbols

Table1 %>% mutate(ID = str_extract("[0-9]*")) 

If you want to keep the first sequence of 5 digits

Table1 %>% mutate(ID = str_extract("[0-9]{5}"))

This answers your second question, but does not use the fuzzyjoin package

tom
  • 725
  • 4
  • 17