Answering this question last night, I spent a good hour trying to find a solution that didn't grow a data.frame
in a for loop, without any success, so I'm curious if there's a better way to go about this problem.
The general case of the problem boils down to this:
- Merge two
data.frames
- Entries in either
data.frame
can have 0 or more matching entries in the other. - We only care about entries that have 1 or more matches across both.
- The match function is complex involving multiple columns in both
data.frame
s
For a concrete example I will use similar data to the linked question:
genes <- data.frame(gene = letters[1:5],
chromosome = c(2,1,2,1,3),
start = c(100, 100, 500, 350, 321),
end = c(200, 200, 600, 400, 567))
markers <- data.frame(marker = 1:10,
chromosome = c(1, 1, 2, 2, 1, 3, 4, 3, 1, 2),
position = c(105, 300, 96, 206, 150, 400, 25, 300, 120, 700))
And our complex matching function:
# matching criteria, applies to a single entry from each data.frame
isMatch <- function(marker, gene) {
return(
marker$chromosome == gene$chromosome &
marker$postion >= (gene$start - 10) &
marker$postion <= (gene$end + 10)
)
}
The output should look like an sql
INNER JOIN
of the two data.frames, for entries where isMatch
is TRUE
.
I've tried to construct the two data.frames
so that there can be 0 or more matches in the other data.frame
.
The solution I came up with is as follows:
joined <- data.frame()
for (i in 1:nrow(genes)) {
# This repeated subsetting returns the same results as `isMatch` applied across
# the `markers` data.frame for each entry in `genes`.
matches <- markers[which(markers$chromosome == genes[i, "chromosome"]),]
matches <- matches[which(matches$pos >= (genes[i, "start"] - 10)),]
matches <- matches[which(matches$pos <= (genes[i, "end"] + 10)),]
# matches may now be 0 or more rows, which we want to repeat the gene for:
if(nrow(matches) != 0) {
joined <- rbind(joined, cbind(genes[i,], matches[,c("marker", "position")]))
}
}
Giving the results:
gene chromosome start end marker position
1 a 2 100 200 3 96
2 a 2 100 200 4 206
3 b 1 100 200 1 105
4 b 1 100 200 5 150
5 b 1 100 200 9 120
51 e 3 321 567 6 400
This is quite an ugly and clungy solution, but anything else I tried was met with failure:
- use of
apply
, gave me alist
where each element was a matrix, with no way torbind
them. - I can't specify the dimensions of
joined
first, because I don't know how many rows I will need in the end.
I'm sure I will come up with a problem of this general form in the future. So what's the correct way to solve this kind of problem?