Matching multiple columns on different data frames and getting other column as result

Question

I got two big data frames, one (df1) has this structure

   chr    init
1  12  25289552
2   3 180418785
3   3 180434779

The other (df2) has this

    V1    V2     V3
10  1     69094 medium
11  1     69094 medium
12  12 25289552 high
13  1     69095 medium
14  3 180418785 medium
15  3 180434779 low

What I'm trying to do is to add the column V3 of df2 to df1, to get the info of the mutation

   chr    init  Mut
1  12  25289552 high
2   3 180418785 medium
3   3 180434779 low

I'm trying loading both into R and then doing a for loop using match but it doesn't work. Do you know any special way to do this? I am also open to do using awk or something similar

score 20 · Accepted Answer · answered Nov 08 '12 at 10:40

20

Use merge

df1 <- read.table(text='  chr    init
1  12  25289552
2   3 180418785
3   3 180434779', header=TRUE)


df2 <- read.table(text='    V1    V2     V3
10  1     69094 medium
11  1     69094 medium
12  12 25289552 high
13  1     69095 medium
14  3 180418785 medium
15  3 180434779 low', header=TRUE)


merge(df1, df2, by.x='init', by.y='V2') # this works!
       init chr V1     V3
1  25289552  12 12   high
2 180418785   3  3 medium
3 180434779   3  3    low

To get your desired output the way you show it

output <- merge(df1, df2, by.x='init', by.y='V2')[, c(2,1,4)]
colnames(output)[3] <- 'Mut' 
output
  chr      init    Mut
1  12  25289552   high
2   3 180418785 medium
3   3 180434779    low

answered Nov 08 '12 at 10:40

Jilber Urbina

58,147
10
114
138

4

Yes, this what I want, the point is that I have to take into account the chromosome also, so maybe something like that merge(df1, df2, by.x=c('chr','init'), by.y=c('V1',V2')[, c(2,1,4)] – user976991 Nov 08 '12 at 11:02
1

Exactly, just adding `chr` and `V1` to the arguments will take them into account :D Consider upvote the useful answers and accept one of them if you find it useful :D – Jilber Urbina Nov 08 '12 at 11:13

score 3 · Answer 2 · answered Nov 08 '12 at 10:40

df1 <- read.table(textConnection("   chr    init
 1  12  25289552
 2   3 180418785
 3   3 180434779"), header=T)

df2 <- read.table(textConnection("    V1    V2     V3
 10  1     69094 medium
 11  1     69094 medium
 12  12 25289552 high
 13  1     69095 medium
 14  3 180418785 medium
 15  3 180434779 low"), header=T)

# You have to select the values of df2$V3 such as their corresponding V2 
# are equal to the values of df1$init
df1$Mut <- df2$V3[ df2$V2 %in% df1$init]

df1
  chr      init    Mut
1  12  25289552   high
2   3 180418785 medium
3   3 180434779    low

Note that this will not work if there are any key values in `df1` that are absent from `df2`. You would get an error like "replacement has 3 rows, data has 4". See http://stackoverflow.com/questions/1299871/how-to-join-merge-data-frames-inner-outer-left-right/38130460#38130460 for a left join implementation using `match()`. — bgoldst, Jun 30 '16 at 18:18

score 0 · Answer 3 · answered Nov 08 '12 at 10:38

0

Does

df3 <- merge( df1, df2, by.x = "init", by.y = "V2" )
df3 <- df3[-3]
colnames( df3 )[3] <- "Mut"

give you what you want?

answered Nov 08 '12 at 10:38

vaettchen

7,299
22
41

score 0 · Answer 4 · answered Jan 12 '19 at 18:05

@user976991 comment worked for me.

Same idea but need to match on two columns.

My domain context is a product database with multiple entries (potentially price entries). Want to drop the older update_nums and only keep the most recent by product_id.

raw_data <- data.table( product_id = sample(10:13, 20, TRUE), update_num = sample(1:3, 20, TRUE), stuff = rep(1, 20, sep = ''))
max_update_nums <- raw_data[ , max(update_num), by = product_id]
distinct(merge(dt, max_update_nums, by.x = c("product_id", "update_num"), by.y = c("product_id", "V1")))

score 0 · Answer 5 · answered Jun 16 '21 at 20:52

When I hit the problem recently I ended up creating a new column in the datasets and used that as a single column to join on.

#create new column for join
df1$id <- paste0("chr" , "init")
df2$id <- paste0("V1","V2") 
# join and select outputs
df3 <-  dplyr::left_join(x = df1, y = df2, by = "id")%>%
 dplyr::select(chr, init, V3)

It did the trick for me.

Matching multiple columns on different data frames and getting other column as result

5 Answers5

Linked