Working on 2 mixed data frames

Question

I have two dataframes:

> # Create dataframe1
> ID <- c('ID1','ID2','ID3','ID4','ID5','ID6','ID7')
> hr <- c(56,32,38,NA,42,23,35)
> cr <- c(1,4,5,2,2,5,4)
> data1 <- data.frame(ID,hr,cr)
> data1
   ID hr cr
1 ID1 56  1
2 ID2 32  4
3 ID3 38  5
4 ID4 NA  2
5 ID5 42  2
6 ID6 23  5
7 ID7 35  4
> 
> # Create dataframe2
> ID <- c('ID1','ID2','ID5','ID7','ID9','ID6','ID7')
> hr <- c(23,13,15,49,22,24,23)
> cr <- c(1,4,5,2,2,5,4)
> data2 <- data.frame(ID,hr,cr)
> data2
  ID hr cr
1 ID1  23   1
2 ID2  13   4
3 ID5  15   5
4 ID7  49   2
5 ID9  22   2
6 ID6  24   5
7 ID7  23   4

If the values in columns ID and cr match in both dataframes, I want to paste dataframe2$hr column onto dataframe 1 (otherwise paste NA), ie. if dataframe1$ID == dataframe2$ID & dataframe1$cr == dataframe2$cr, then paste the value of dataframe2$hr into a new column in dataframe1 and if they don't match, paste NA.

What is the best way of doing this? I tried this way:

data1$newhr=ifelse(data1$ID%in%data2$ID2 & data1$cr%in%data2$cr2,data2$hr2,"NA")

which works for this small example, however for my larger dataframes (which vary in dimensions) this does not work. I get the following error message:

Error in $<-.data.frame(*tmp*, "newP", value = c(NA, NA, NA, NA, NA, : replacement has 563 rows, data has 48

In addition, a warning message in gwascat_highlight$V2 %in% testfull3$gcSNP & gwascat_highlight$V3 %in% :

longer object length is not a multiple of shorter object length

I cannot recreate my larger data frames, however is there an alternative method of carrying out this task that I could try?

Please share the expected output as well. – Ronak Shah Apr 05 '16 at 14:17 — Ronak Shah, Apr 05 '16 at 14:17

Marcelo Bielsa · Answer 1 · 2016-04-05T14:39:06.880

2

I am assuming that the rows in the dataframes must be in the same order. Else you must perform the check line by line or reorder using merge as suggested by M.D., e.g., t2 <- merge(data1, data2, by = c("ID", "cr"), all.x = TRUE) then comparing the merge against data1).

Split your problem in two:

if (compare_dfs(data1,data2)){
  data1$newhr <- data2$hr
} else {
  data1$newhr <-NA
}

then you need to define the comparison method:

 all(data1[,c("ID","cr")] == data2[,c("ID","cr")])

will work for dfs with the same dimension. You can also use

identical(data1[,c("ID","cr")],data2[,c("ID","cr")])

which will work regardless of dimensions.

edited Apr 05 '16 at 14:39

answered Apr 05 '16 at 14:31

Marcelo Bielsa

142
11

What would the outcome look like if the rows were not in the same order? I can order both by value of "ID", however the second data frame will have other ID values not present in the first. – IcedCoffee Apr 05 '16 at 14:35
But the premise is pretty clear: "If the values in columns ID and cr match in both dataframes,...". In any case, my answer clearly states that this works in case the orders of the "ID" column match. Else compare data1[,c("ID","cr") ] against data3[,c("ID","cr")] where data3<-merge(data1, data2, by = c("ID", "cr"), all.x = TRUE); names(dat)[names(dat)=="hr.y"]<-"hr" – Marcelo Bielsa Apr 05 '16 at 14:40

score 1 · Answer 2 · answered Apr 05 '16 at 14:23

1

With library dplyr, you can use the inner_join as follows:

library(dplyr)
inner_join(data1, data2, by = c('ID', 'cr'))

Output is:

   ID hr.x cr hr.y
1 ID1   56  1   23
2 ID2   32  4   13
3 ID6   23  5   24
4 ID7   35  4   23

answered Apr 05 '16 at 14:23

Gopala

10,363
7
45
77

score 1 · Accepted Answer · answered Apr 05 '16 at 14:23

Is the following what you want?

merge(data1, data2, by = c("ID", "cr"), all.x = TRUE)

Result:

   ID cr hr.x hr.y
1 ID1  1   56   23
2 ID2  4   32   13
3 ID3  5   38   NA
4 ID4  2   NA   NA
5 ID5  2   42   NA
6 ID6  5   23   24
7 ID7  4   35   23

Same can be achieved with dplyr, data.table and so on.

Working on 2 mixed data frames

3 Answers3