9

I am trying to deal with some very messy data. I need to merge two large data frames which contain different kinds of data by the sample ID. The problem is that one table's sample IDs are in many different formats, but most contain the required ID string for matching somewhere in their ID, e.g. sample "1234" in one table has got an ID of "ProjectB(1234)" in the other.

I have made a minimal reproducible example.

a<-data.frame(aID=c("1234","4567","6789","3645"),aInfo=c("blue","green","goldenrod","cerulean"))
b<-data.frame(bID=c("4567","(1234)","6789","23645","63528973"), bInfo=c("apple","banana","kiwi","pomegranate","lychee"))

using merge gets part of the way:

merge(a,b, by.x="aID", by.y="bID", all=TRUE)

       aID     aInfo       bInfo
1     1234      blue        <NA>
2     3645  cerulean        <NA>
3     4567     green       apple
4     6789 goldenrod        kiwi
5   (1234)      <NA>      banana
6    23645      <NA> pomegranate
7 63528973      <NA>      lychee

but the output that would be liked is basically:

        ID     aInfo       bInfo
1     1234      blue      banana
2     3645  cerulean pomegranate
3     4567     green       apple
4     6789 goldenrod        kiwi
5 63528973      <NA>      lychee

I just wondered if there was a way to incorporate grep into this or another R-tastic method?

Thanks in advance

oguz ismail
  • 1
  • 16
  • 47
  • 69
  • 1
    `agrep` is approximate string matching – rawr Jan 16 '14 at 15:05
  • 1
    Examples: `b[ agrep('1234', b$bID), ]` `b[ agrep('3645', b$bID), ]` both return the proper IDs from `b` – rawr Jan 16 '14 at 15:07
  • @rawr You should make that an answer. – Roland Jan 16 '14 at 15:10
  • Thanks, I guess my problem then is how to use this agrep in the merging of the two data frames. – bioinformagician Jan 16 '14 at 15:11
  • 1
    I would probably clean the data first. Keep separate files for the original, clean, and merged data. Keep your code separate. Version control, etc. All are good programming practices :) – rawr Jan 16 '14 at 15:13
  • 1
    In case someone is interested, a similar but slightly different question regarding inexact numeric IDs matching can be found here: [How to join data frames based on condition between 2 columns](https://stackoverflow.com/questions/29927179/how-to-join-data-frames-based-on-condition-between-2-columns). – HarmlessEcon Aug 28 '21 at 13:08

3 Answers3

5

Doing merge on a condition is a little tricky. I don't think you can do it with merge as it is written, so you end up having to write a custom function with by. It is pretty inefficient, but then, so is merge. If you have millions of rows, consider data.table. This is how you would do a "inner join" where only rows that match are returned.

# I slightly modified your data to test multiple matches    
a<-data.frame(aID=c("1234","1234","4567","6789","3645"),aInfo=c("blue","blue2","green","goldenrod","cerulean"))
b<-data.frame(bID=c("4567","(1234)","6789","23645","63528973"), bInfo=c("apple","banana","kiwi","pomegranate","lychee"))

f<-function(x) merge(x,b[agrep(x$aID[1],b$bID),],all=TRUE)
do.call(rbind,by(a,a$aID,f))

#         aID     aInfo    bID       bInfo
# 1234.1 1234      blue (1234)      banana
# 1234.2 1234     blue2 (1234)      banana
# 3645   3645  cerulean  23645 pomegranate
# 4567   4567     green   4567       apple
# 6789   6789 goldenrod   6789        kiwi

Doing a full join is a little trickier. This is one way, that is still inefficient:

f<-function(x,b) {
  matches<-b[agrep(x[1,1],b[,1]),]
  if (nrow(matches)>0) merge(x,matches,all=TRUE)
  # Ugly... but how else to create a data.frame full of NAs?
  else merge(x,b[NA,][1,],all.x=TRUE)
}
d<-do.call(rbind,by(a,a$aID,f,b))
left.over<-!(b$bID %in% d$bID)
rbind(d,do.call(rbind,by(b[left.over,],'bID',f,a))[names(d)])

#         aID     aInfo      bID       bInfo
# 1234.1 1234      blue   (1234)      banana
# 1234.2 1234     blue2   (1234)      banana
# 3645   3645  cerulean    23645 pomegranate
# 4567   4567     green     4567       apple
# 6789   6789 goldenrod     6789        kiwi
# bID    <NA>      <NA> 63528973      lychee
nograpes
  • 18,623
  • 1
  • 44
  • 67
  • Sorry, that works if all you want is a merge with `all.x=TRUE`, not `all=TRUE`. I will update shortly. – nograpes Jan 16 '14 at 15:29
  • I updated the answer to include a full join as well. – nograpes Jan 16 '14 at 15:58
  • In fact, I think I'm halfway to having a general `merge.by.condition` function here. – nograpes Jan 16 '14 at 16:21
  • You mean like `plyr::join`? :) data frame full of NAs `data.frame(matrix(NA, 10, 5))` – rawr Jan 16 '14 at 17:01
  • @rawr No, not like `plyr:::join`, because that does not allow you to join on a *condition* other just the equality of two columns. Indeed, the documentation notest that `plyr:::join` is *less* featureful than `merge` Also, while it is possible to generate a `data.frame` full of `NA` that way, it is necessary to also have the names, so that `rbind` doesn't barf. Also, I would need to dynamically calculate the number of rows and columns, but that wouldn't be that hard. – nograpes Jan 16 '14 at 17:46
  • Post your merge when you perfect it, and I'll be using it :) – rawr Jan 16 '14 at 18:33
  • what about the `match.data.frame` function from `Ecdat` package? – user2380782 Mar 13 '15 at 11:33
3

This is an answer using data.table, inspired by @nograpes.

## Create example tables; I added the sarcoline cases
##   so there would be examples of rows in a but not b
a <- data.table(aID=c("1234","1234","4567","6789","3645","321", "321"),
                aInfo=c("blue","blue2","green","goldenrod","cerulean",
                        "sarcoline","sarcoline2"),
                key="aID")
b <- data.table(bID=c("4567","(1234)","6789","23645","63528973"),
                bInfo=c("apple","banana","kiwi","pomegranate","lychee"),
                key="bID")

## Use agrep to get the rows of b by each aID from a
ab <- a[, b[agrep(aID, bID)], by=.(aID, aInfo)]
ab
##     aID     aInfo    bID       bInfo
## 1: 1234      blue (1234)      banana
## 2: 1234     blue2 (1234)      banana
## 3: 3645  cerulean  23645 pomegranate
## 4: 4567     green   4567       apple
## 5: 6789 goldenrod   6789        kiwi

So far we've only got an inner join, so now let's add the unmatched rows from the original tables:

ab <- rbindlist(list(ab, a[!ab[, unique(aID)]], b[!ab[, unique(bID)]]), fill=TRUE)

These steps are optional and are included to match the output from the OP:

## Update NA values of aID with the value from bID
ab[is.na(aID), aID:=bID]

## Drop the bID column
ab[, bID:=NULL]

Final result

ab
##         aID      aInfo       bInfo
## 1:     1234       blue      banana
## 2:     1234      blue2      banana
## 3:     3645   cerulean pomegranate
## 4:     4567      green       apple
## 5:     6789  goldenrod        kiwi
## 6:      321  sarcoline          NA
## 7:      321 sarcoline2          NA
## 8: 63528973         NA      lychee
dnlbrky
  • 9,396
  • 2
  • 51
  • 64
1

I would clean your bIDs some more before merging. If you know all the weird ways in which the bIDs have been formatted then it should be straightforward to clean them up using gsub().

In your example, to remove the brackets I would do something like

expr <- '\\((.*)\\)'
b$bID <- gsub(expr, replace='\\1', b$bID)

In expr there's a few things going on. Firstly there is .* which is regexp for any character any number of times. Wrapping this in brackets lets gsub know that we want to keep it and can refer to it in the replace expression. In order to use left and right brackets as actually characters we need to escape them with double backslashes. Putting all this together would read as; I want to keep everything between a left bracket and a right bracket.

Note that you can do fancy things with your replace expression such as replace='id_\\1'.

In regards to finding an ID within a number sequence you would have to try substring matching or something, but I dont consider that a good approach.

Hope this helps.

Stuples
  • 1,376
  • 10
  • 8