Matching columns in 2 data frames when numbers don't exactly match

Question

How do I match two different data frames when the values I am comparing are not exactly the same?

I was thinking of using merge() but I am not sure.

Table1:

ID           Value.1
10001        x
18273-9      y
12824/5/6/7  z
10283/5/9    d

Table2:

ID           Value.2
10001        a
18274        b
12826        c
10289        u

How do I merge Table 1 and 2 based on ID?

Which specific function of fuzzyjoin package would I use, especially with the "/" & "-" cases? How do I expand the "-" case from 18273-9 so that R will register 18273 / 18274 / 18275 / ...?

Check out the [`fuzzyjoin`](https://cran.r-project.org/web/packages/fuzzyjoin/fuzzyjoin.pdf) package. — Humpelstielzchen, Jul 17 '19 at 09:41
Possible duplicate of [Joining two datasets using fuzzy logic](https://stackoverflow.com/questions/44359860/joining-two-datasets-using-fuzzy-logic) — Humpelstielzchen, Jul 17 '19 at 09:42
What type of join (inner, left, full) are you looking? What's your expected output? — acylam, Jul 17 '19 at 14:46

Patrik_P · Answer 1 · 2019-07-22T08:34:16.817

1

You can write a function to extract the corresponding sequences from the strings containing "/" or "-" and recombine them into a new data.frame as follows:

df1 <- data.frame(ID=c("10001","18273-9","15273-8", "15170-4",  "12824/5/6/7","10283/5/9"), 
                  value=c("a","c","c", "d","k", "l"), stringsAsFactors = F)

df2 <- data.frame(ID=c("10001","18274","12826","10289"), 
                  value=c("o","p","q","r"), stringsAsFactors = F)

doIt <- function(df){
  listAsDF <- function(l) {
    x <- stack(setNames(l, temp$value))
    names(x) <- c("ID", "value")
    return(x)
  }
  Base <- df[!grepl("\\/", df$ID) & !grepl("\\-", df$ID), ]
  #1 cases when - present
  temp <- df[grep("\\-", df$ID),]
  temp <- listAsDF(lapply(strsplit(temp$ID, "-"), function(e) seq(e[1], paste0(strtrim(e[1], nchar(e[1])-1), e[2]), 1)))
  Base <- rbind(Base, temp)
  #2 cases when / present
  temp <- df[grep("\\/", df$ID),]
  temp <- listAsDF(lapply(strsplit(temp$ID, "/"), function(a) c(a[1], paste0(strtrim(a[1], nchar(a[1])-1), a[-1]))))
  Base <- rbind(Base, temp)
  return(Base)
}

Then you can mergge the df2 and df1:

merge(doIt(df1), df2, by = "ID", all.x = T)

Hope this helps!

edited Jul 22 '19 at 08:34

answered Jul 17 '19 at 14:37

Patrik_P

3,066
3
22
39

This has been really helpful! May I know why you use setNames with x and temp? Thank you! – chu-js Jul 18 '19 at 03:31
`setNames` gives names to an object and returns the object itself, this is what I was going for as I need the names further in the stack function. – Patrik_P Jul 18 '19 at 06:40
Do you also mind explaining the arguments in `strtrim`? I googled it and there are only supposed to be two — `strtrim(x, width)`? Thank you so much!! Sorry for bugging you, I'm just starting out. – chu-js Jul 18 '19 at 08:23
`strtrim` is trimming a string to desired width, like you say it takes two args, if you take the code apart, you see, that is exactly what is going on. `x` is `e[1]`/`a[1]` and `width` is `nchar(e[1])-1`/`nchar(a[1])-1` – Patrik_P Jul 18 '19 at 08:35

score 0 · Answer 2 · answered Jul 17 '19 at 11:23

You could use the fuzzy string matching function "agrep" from base R.

df1 <- data.frame(ID=c("10001","18273-9","12824/5/6/7","10283/5/9"), 
                  value=c("a","c","d","k"))

df2 <- data.frame(ID=c("10001","18274","12826","10289"), 
                  value=c("o","p","q","r"))

apply(df1, 1, function(x) agrep(x["ID"], df2$ID, max = 3.5))

As you see it struggles to find the match for row 4. So it might make sense to clean your ID variable (e.g., take out the "/") before running agrep.

tom · Answer 3 · 2019-07-17T11:40:24.327

One option could consist in extracting the format of ID you want to keep. And then do your merge.

You can format your ID column as follow :

library(stringr) 
library(dplyr)

If you want only the digits before any symbols

Table1 %>% mutate(ID = str_extract("[0-9]*"))

If you want to keep the first sequence of 5 digits

Table1 %>% mutate(ID = str_extract("[0-9]{5}"))

This answers your second question, but does not use the fuzzyjoin package

Matching columns in 2 data frames when numbers don't exactly match

3 Answers3