merge data frames based on non-identical values in R

Question

I have two data frames. First one looks like

dat <- data.frame(matrix(nrow=2,ncol=3))
names(dat) <- c("Locus", "Pos", "NVAR")
dat[1,] <- c("ACTC1-001_1",   "chr15:35087734..35087734", "1" )
dat[2,] <- c("ACTC1-001_2 ",  "chr15:35086890..35086919", "2")

where chr15:35086890..35086919 indicates all the numbers within this range.

The second looks like:

dat2 <- data.frame(matrix(nrow=2,ncol=3))
names(dat2) <- c("VAR","REF.ALT","     FUNC")
dat2[1,] <- c("chr1:116242719",   "T/A", "intergenic" )
dat2[2,] <- c("chr1:116242855",  "A/G", "intergenic")

I want to merge these by the values in dat$Pos and dat2$VAR. If the single number in a cell in dat2$VAR is contained within the range of a cell in dat$Pos, I want to merge those rows. If this occurs more than once (dat2$VAR in more than one range in dat$Pos, I want it merged each time). What's the easiest way to do this?

I would use some regular expressions and add new columns. `dat2$VAR_fix <- as.integer(gsub('chr1:', '', dat2$VAR))`. And I'd split the dat$Pos into its lower and upper bound as two columns. Then you can do your checks with actual numbers. — Justin, Dec 04 '12 at 00:44
is it possible for a number in `dat2` to appear in more than one row of `dat`? — Ricardo Saporta, Dec 04 '12 at 00:59
Is it possible to un-summarize `dat`, i.e. convert it so it has one number per row instead of ranges (we could provide the algorithm for it), or will it result in too many possibilities? — flodel, Dec 04 '12 at 01:12
I could split like that Justin, but I wanted to see if there was an easier way. Yes, dat2 numbers may repeat, I did mention that. I could unsummarize dat, but it would lose some of its meaning. And, dat is 865*12 and dat2 is 5553*10. — cianius, Dec 04 '12 at 01:19

Ricardo Saporta · Answer 1 · 2012-12-04T02:15:40.047

1

Please try this out and let us know how it works. Without a larger data set it is a bit hard to trouble shoot. If for whatever reason it does not work, please share a few more rows from your data tables (specifically ones that would match)

SPLICE THE DATA

range.strings <- do.call(rbind, strsplit(dat$Pos, ":"))[, 2]
range.strings <- do.call(rbind, strsplit(range.strings, "\\.\\."))

mins <- as.numeric(range.strings[,1])
maxs <- as.numeric(range.strings[,2])

d2.vars <- as.numeric(do.call(rbind, str_split(dat2$VAR, ":"))[,2])
names(d2.vars) <- seq(d2.vars)

FIND THE MATCHES

# row numebr is the row in dat
# col number is the row in dat2 
matches <- sapply(d2.vars, function(v)  mins < v & v <= maxs)

MERGE

# create a column in dat to merge-by
dat <- cbind(dat, VAR=NA)

# use the VAR in dat2 as the merge id
sapply(seq(ncol(matches)), function(i)
    dat$VAR <- dat2[i, "VAR"] )

merge(dat, dat2)

edited Dec 04 '12 at 02:15

answered Dec 04 '12 at 01:03

Ricardo Saporta

54,400
17
144
178

I would like all the columns in the matching rows to be placed on the same row. So dat2 added to dat1 in matching rows, as extra columns. – cianius Dec 04 '12 at 01:07
I think you are missing the point that here, the `x..y` notation means the whole `[x, y]` range, so using `grep` is not enough. I even think that `chr1` (a chromosome number ?) needs to be accounted for when doing the merge. – flodel Dec 04 '12 at 01:21
I understood it as a long string... But what you are saying is that it should be x < VAR < y ? – Ricardo Saporta Dec 04 '12 at 01:23
Yes chr is chromosome, so number is to be included. And var can be equal to, or within X and Y. – cianius Dec 04 '12 at 01:33
@pepsi, the part I remain unclear about: can a row in dat2 be matched to more than one row in dat? – Ricardo Saporta Dec 04 '12 at 01:56
When I try this on my data, I end up with an empty data frame that just contains the column names from both data frames. – cianius Dec 04 '12 at 10:51
@pepsimax, posting a larger example set would be very helpful. Also, as @flodel pointed out, please make sure your strings are `character`, not `factor` – Ricardo Saporta Dec 04 '12 at 15:47

score 1 · Accepted Answer · edited May 23 '17 at 12:21

1

Here is a solution, quite short but not particularly efficient so I would not recommend it for large data. However, you seemed to indicate your data was not that large so give it a try and let me know:

library(plyr)

exploded.dat <- adply(dat, 1, function(x){
    parts <- strsplit(x$Pos, ":")[[1]]
    chr   <- parts[1]
    range <- strsplit(parts[2], "..", fixed = TRUE)[[1]]
    start <- range[1]
    end   <- range[2]
    data.frame(VAR = paste(chr, seq(from = start, to = end), sep = ":"), x)
})

merge(dat2, exploded.dat, by = "VAR")

If it is too slow or uses too much memory for your needs, you'll have to implement something a bit more complex and this other question looks like a good starting point: Merge by Range in R - Applying Loops.

edited May 23 '17 at 12:21

Community

1
1

answered Dec 04 '12 at 01:53

flodel

87,577
21
185
223

I think you may have to escape the dots. – Ricardo Saporta Dec 04 '12 at 02:08
@RicardoSaporta, no, since I am using `strsplit` with `fixed = TRUE`. – flodel Dec 04 '12 at 02:16
flodel, Am I right in saying that strsplit works on matrices, but not data frames? Because when I try this on dat (a data frame), I get Error in strsplit(x$POS, ":") : non-character argument. When I convert to matrix and edit this line to parts <- strsplit(x[,2], ":")[[1]] (because POS is second column), It gives me "Error in x[, 2] : incorrect number of dimensions". – cianius Dec 04 '12 at 10:56
1

no, it must be because `class(dat$POS)` is `factor` or something else, but not `character` as in your example. Try changing its class to `character` or replace `x$Pos` with `as.character(x$Pos)` in my code. – flodel Dec 04 '12 at 11:24
I also managed to solve it by using the IRanges package from Bioconductor, which I sheepishly acknowledge is made for exactly this problem.. always good to learn different ways to solve a problem though. – cianius Dec 05 '12 at 19:01

merge data frames based on non-identical values in R

2 Answers2

SPLICE THE DATA

FIND THE MATCHES

MERGE