Intersect dataframe on multiple columns

Question

I have 2 dataframes:

dat: 1900 obs of 9 variables

              V1        V2        V3        V4 V5      V6       V7     V8             V9
1        V_P50P50_Q3 chr12 106642383 106642395  + 18.1425 4.03e-08 0.0515  GGGGGACTCCCCC
2 V_P50RELAP65_Q5_01  chr8 142276666 142276677  - 16.6429 2.51e-07 0.2780   GGGATTTCCCAC
3          V_RELA_Q6 chr22  51020067  51020078  - 15.9395 2.71e-07 0.3350   GGGAATTTCCCC
4       V_NFKB_Q6_01 chr14  98601454  98601469  + 17.0684 3.08e-07 0.236  GGAGTGGAAATTCC
5          V_CREL_Q6 chr22  51020068  51020079  - 16.1165 3.19e-07 0.4050   AGGGAATTTCCC

dat.markov: 1486 obs of 9 variables

            V1    V2        V3        V4 V5      V6       V7    V8               V9
1 V_NFKB_Q6_01 chr14  98601454  98601469  + 17.2212 1.33e-07 0.146 GGAGTGGAAATTCCCT
2  V_P50P50_Q3 chr12 106642383 106642395  + 16.9358 1.57e-07 0.201    GGGGGACTCCCCC
3    V_CREL_Q6 chr22  51020068  51020079  - 16.0549 2.29e-07 0.292     AGGGAATTTCCC
4 V_NFKB_Q6_01 chr22  51020064  51020079  + 16.9906 2.32e-07 0.146 TTGGGGGAAATTCCCT
5    V_RELA_Q6 chr22  51020067  51020078  - 15.7496 3.42e-07 0.433     GGGAATTTCCCC

I need to merge the two data frames such that I get all the rows with matching columns V1, V2, V3 and V4 between the two data.frames.

I tried:

y<-merge(dat,dat.markov,by=c("V1","V2","V3","V4"))

which gives me a merged dataframe but with 1513 obs. But technically, the number of observations should be equal to or less than the smaller dataframe i.e. 1486 obs.

My merged data.frame looks alright in terms of number of columns returned:

         V1    V2        V3        V4 V5.x    V6.x     V7.x  V8.x       V9.x V5.y    
1 V_CREL_01 chr10 112778464 112778473    + 12.9434 1.94e-05 0.694 TGGGTTTTCC    +  
     V6.y     V7.y  V8.y       V9.y
1 12.8838 2.35e-05 0.788 TGGGTTTTCC

I know you can intersect the data.frames using one column but is there a way in which you can intersect two data.frames on multiple columns?

You're doing this correctly, however, if you have rows where the combination of those three columns is not unique, you can have a very large result! e.g. `merge(data.frame(X=rep(1:4, 2), Y1=LETTERS[1:4]), data.frame(X=1:3, Y2=letters[1:3]))` where the result is longer than the second input. — Justin, Sep 09 '13 at 19:33
I think you are right, I just checked and there are quite a few duplicates in my merged dataframe. So depending on the data, my code would work. Thanks! — Komal Rathi, Sep 09 '13 at 19:51

score 3 · Accepted Answer · answered Sep 09 '13 at 19:44

3

If I understand, you want columns 1,2,3,4 to be perfectly matched in both frames in order to keep them? Besides merge, I would use the interaction and match functions

dat$combine = as.character(interaction(dat$V1, dat$V2, dat$V3, dat$V4))
dat.markov$combine = as.character(interaction(dat.markov$V1, dat.markov$V2, dat.markov$V3, dat.markov$V4))

dat.overlap = dat[complete.cases(match(dat.markov$combine, dat$combine)),]

answered Sep 09 '13 at 19:44

dylanjf

191
1
4

This doesn't bring the additional columns contained in `dat.markov` into the resulting `dat.overlap` like a `merge` would. – Justin Sep 09 '13 at 19:59
true. you could combine that last line with a similar call to dat.markov and flip the `match` arguements, then specify the columns you'd like. really though, `merge` then `unique` on an ID-type variable is the way to go. – dylanjf Sep 09 '13 at 20:17
@dylanjf: I am really having memory problems with the `interaction` function. R just hangs when I try to use it, so I will just have to merge my data. I tried `unique` previously but I dont think I can use it because even though the V1, V2, V3 and V4 columns are same, the corresponding columns are non-unique. – Komal Rathi Sep 09 '13 at 20:34

Intersect dataframe on multiple columns

1 Answers1