Efficiently merging two data frames on a non-trivial criteria

Question

Answering this question last night, I spent a good hour trying to find a solution that didn't grow a data.frame in a for loop, without any success, so I'm curious if there's a better way to go about this problem.

The general case of the problem boils down to this:

Merge two data.frames
Entries in either data.frame can have 0 or more matching entries in the other.
We only care about entries that have 1 or more matches across both.
The match function is complex involving multiple columns in both data.frames

For a concrete example I will use similar data to the linked question:

genes <- data.frame(gene       = letters[1:5], 
                    chromosome = c(2,1,2,1,3),
                    start      = c(100, 100, 500, 350, 321),
                    end        = c(200, 200, 600, 400, 567))
markers <- data.frame(marker = 1:10,
                   chromosome = c(1, 1, 2, 2, 1, 3, 4, 3, 1, 2),
                   position   = c(105, 300, 96, 206, 150, 400, 25, 300, 120, 700))

And our complex matching function:

# matching criteria, applies to a single entry from each data.frame
isMatch <- function(marker, gene) {
  return(
    marker$chromosome == gene$chromosome & 
    marker$postion >= (gene$start - 10) &
    marker$postion <= (gene$end + 10)
  )
}

The output should look like an sql INNER JOIN of the two data.frames, for entries where isMatch is TRUE. I've tried to construct the two data.frames so that there can be 0 or more matches in the other data.frame.

The solution I came up with is as follows:

joined <- data.frame()
for (i in 1:nrow(genes)) {
   # This repeated subsetting returns the same results as `isMatch` applied across
   # the `markers` data.frame for each entry in `genes`.
   matches <- markers[which(markers$chromosome == genes[i, "chromosome"]),]
   matches <- matches[which(matches$pos >= (genes[i, "start"] - 10)),]
   matches <- matches[which(matches$pos <= (genes[i, "end"] + 10)),]
   # matches may now be 0 or more rows, which we want to repeat the gene for:
   if(nrow(matches) != 0) {
     joined <- rbind(joined, cbind(genes[i,], matches[,c("marker", "position")]))
   }
}

Giving the results:

   gene chromosome start end marker position
1     a          2   100 200      3       96
2     a          2   100 200      4      206
3     b          1   100 200      1      105
4     b          1   100 200      5      150
5     b          1   100 200      9      120
51    e          3   321 567      6      400

This is quite an ugly and clungy solution, but anything else I tried was met with failure:

use of apply, gave me a list where each element was a matrix, with no way to rbind them.
I can't specify the dimensions of joined first, because I don't know how many rows I will need in the end.

I'm sure I will come up with a problem of this general form in the future. So what's the correct way to solve this kind of problem?

When i run your code I get output (`joined`) that doesn't really make sense, can you show what you're expecting for the output from your demo? — alexwhan, Sep 17 '13 at 02:55
Whoops! There was a bug (`>=` should have been `<=` in one instance). Fixed and updated with the output I get. — Scott Ritchie, Sep 17 '13 at 03:05

score 4 · Answer 1 · answered Sep 17 '13 at 03:11

I dealt with a very similar problem myself by doing the merge, and sorting out which rows satisfy the condition afterwards. I don't claim that this is a universal solution, if you're dealing with large datasets where there will be few entries that match the condition, this will likely be inefficient. But to adapt it to your data:

joined.raw <- merge(genes, markers)
joined <- joined.raw[joined.raw$position >= (joined.raw$start -10) & joined.raw$position <= (joined.raw$end + 10),]
joined
#    chromosome gene start end marker position
# 1           1    b   100 200      1      105
# 2           1    b   100 200      5      150
# 4           1    b   100 200      9      120
# 10          2    a   100 200      4      206
# 11          2    a   100 200      3       96
# 16          3    e   321 567      6      400

Much more elegant! But you're right, creating the whole merged `data.frame` is probably not ideal. — Scott Ritchie, Sep 17 '13 at 05:16
Yes, I agree - this is only convenient depending on the data — alexwhan, Sep 17 '13 at 06:31

score 4 · Accepted Answer · edited May 23 '17 at 12:01

A data table solution: a rolling join to fulfill the first inequality, followed by a vector scan to satisfy the second inequality. The join-on-first-inequality will have more rows than the final result (and therefore may run into memory issues), but it will be smaller than a straight-up merge in this answer.

require(data.table)

genes_start <- as.data.table(genes)
## create the start bound as a separate column to join to
genes_start[,`:=`(start_bound = start - 10)]
setkey(genes_start, chromosome, start_bound)

markers <- as.data.table(markers)
setkey(markers, chromosome, position)

new <- genes_start[
    ##join genes to markers
    markers, 
    ##rolling the last key column of genes_start (start_bound) forward
    ##to match the last key column of markers (position)
    roll = Inf, 
    ##inner join
    nomatch = 0
##rolling join leaves positions column from markers
##with the column name from genes_start (start_bound)
##now vector scan to fulfill the other criterion
][start_bound <= end + 10]
##change names and column order to match desired result in question
setnames(new,"start_bound","position")
setcolorder(new,c("chromosome","gene","start","end","marker","position"))
   # chromosome gene start end marker position
# 1:          1    b   100 200      1      105
# 2:          1    b   100 200      9      120
# 3:          1    b   100 200      5      150
# 4:          2    a   100 200      3       96
# 5:          2    a   100 200      4      206
# 6:          3    e   321 567      6      400

One could do a double join, but as it involves re-keying the data table before the second join, I don't think that it will be faster than the vector scan solution above.

##makes a copy of the genes object and keys it by end
genes_end <- as.data.table(genes)
genes_end[,`:=`(end_bound = end + 10, start = NULL, end = NULL)]
setkey(genes_end, chromosome, gene, end_bound)

## as before, wrapped in a similar join (but rolling backwards this time)
new_2 <- genes_end[
    setkey(
        genes_start[
        markers, 
        roll = Inf, 
        nomatch = 0
    ], chromosome, gene, start_bound), 
    roll = -Inf, 
    nomatch = 0
]
setnames(new2, "end_bound", "position")

You absolutely win this. What took 29 minutes with `sqldf` on my full dataset takes just under 2 seconds with `data.table`! And you're right about the second code block -- it's almost twice as slow at just under 4 seconds. — Scott Ritchie, Sep 20 '13 at 01:33
Are the end results identical? I only checked on the test dataset, so I don't know whether it is good for your full dataset. — Blue Magister, Sep 20 '13 at 02:17
I didn't check exactly, but the number of rows came out roughly the same, so I assume so. — Scott Ritchie, Sep 20 '13 at 02:38
Although I may get a speed increase in `sqldf` if I create similar `start_bound` and `end_bound` columns. — Scott Ritchie, Sep 20 '13 at 02:39
+1 I haven't fully followed but can't `roll` be set to `-10` and `+10` directly? — Matt Dowle, Sep 20 '13 at 11:28
@MatthewDowle I attempted that initially, but had some trouble. I believe `roll = -10` sets the match so that `start - 10 <= position <= start`, instead of the desired `start - 10 <= position`. — Blue Magister, Sep 20 '13 at 14:28

Scott Ritchie · Answer 3 · 2013-09-18T01:39:06.737

Another answer I've come up with using the sqldf package.

sqldf("SELECT gene, genes.chromosome, start, end, marker, position 
       FROM genes JOIN markers ON genes.chromosome = markers.chromosome 
       WHERE position >= (start - 10) AND position <= (end + 10)")

Using microbenchmark it performs comparably to @alexwhan's merge and [ method.

> microbenchmark(alexwhan, sql)
Unit: nanoseconds
     expr min    lq median  uq  max neval
 alexwhan 435 462.5  468.0 485 2398   100
      sql 422 456.5  466.5 498 1262   100

I've also attempted to test both functions on some real data of the same format I have lying around (35,000 rows for genes, 2,000,000 rows for markers, with the joined output coming to 480,000 rows).

Unfortunately merge seems unable to handle this much data, falling over at joined.raw <- merge(genes, markers) with an error (which i don't get if reduce the number of rows):

Error in merge.data.frame(genes, markers) : 
  negative length vectors are not allowed

While the sqldf method runs successfully in 29 minutes.

score 0 · Answer 4 · answered Jul 07 '14 at 06:49

After almost one year regarding to this problem you solved for me... now i spent some time to deal with this using another way by awk....

awk 'FNR==NR{a[NR]=$0;next}{for (i in a){split(a[i],x," ");if (x[2]==$2 && x[3]-10 <=$3 && x[4]+10 >=$3)print x[1],x[2],x[3],x[4],$0}}' gene.txt makers.txt > genesnp.txt

which produce the kind of same results:

b   1   100 200 1   1   105
a   2   100 200 3   2   96
a   2   100 200 4   2   206
b   1   100 200 5   1   150
e   3   321 567 6   3   400
b   1   100 200 9   1   120

Efficiently merging two data frames on a non-trivial criteria

4 Answers4

Linked