2

I searched various join questions and none seemed to quite answer this. I have two dataframes which each have an ID column and several information columns.

df1 <- data.frame(id = c(1:100), color = c(rep("blue", 25), rep("red", 25), 
                  rep(NA, 25)), phase = c(rep("liquid", 50), rep("gas", 50)),
                  rand.col = rnorm(100))

df2 <- data.frame(id = c(51:100), color = rep("green", 50), phase = rep("gas", 50))

As you can see, df1 is missing some info that is present in df2, while df2 is only a subset of all the ids, but they both have some similar columns. Is there a way to fill the missing values in df1 based on matching ID's from DF2?

I found a similar question that recommended using merge, but when I tried it, it dropped all the id's that were not present in both dataframes. Plus it required manually dropping duplicate columns and in my real dataset, there will be a large number of these, making doing so cumbersome. Even ignoring that though,

both the recommended solutions:

df1 <- setNames(merge(df1, df2)[-2], names(df1))

and

df1[is.na(df1$color), "color"] <- df2[match(df1$id, df2$id), "color"][which(is.na(df1$color))]

did not work for me, throwing various errors.

An alternate solution I have thought of is using rbind then dropping incomplete cases. The problem is that in my real dataset, while there are shared columns, there are also non-shared columns so I would have to create intermediate objects of just the shared columns, rbind, then drop incomplete cases, then join with the original object to regain the dropped columns. This seems unnecessarily roundabout.

In this example it would look like

df2 = rbind(df1[,colnames(df2)], df2)
df2 = df2[complete.cases(df2),]
df2 = merge(df1[,c("id", "rand.col")], df2, by = "id")

and, in case there are any fully duplicated rows between the two dataframes, I would need to add

df2 = unique(df2)

This solution will work, but it is cumbersome and as the number of columns that are being matched on increase, it gets even worse. Is there a better solution?

-edit- fixed a problem in my example data pointed out by Sathish

-edit2- Expanded example data

df1 = data.frame(id = c(1:100),  wq2 = rnorm(50), wq3 = rnorm(50), wq4 = rnorm(50), 
wq5 = rnorm(50))

df2 = data.frame(id = c(51:100), wq2 = rnorm(50), wq3 = rnorm(50), wq4 = rnorm(50), 
wq5 = rnorm(50))

These dataframe represents the case where there are many columns that have incomplete data and a second dataframe that has all of the missing data. Ideally, we would not need to separately list each each column with wq1 := i.wq1 etc.

C. Denney
  • 577
  • 4
  • 16

1 Answers1

3

If you want to join only by id column, you can remove phase in the on clause of code below.

Also your data in the question has discrepancies, which are corrected in the data posted in this answer.

library('data.table')
setDT(df1)  # make data table by reference
setDT(df2)  # make data table by reference
df1[ i = df2, color := i.color, on = .(id, phase)] # join df1 with df2 by id and phase values, and replace color values of df2 with color values of df1

tail(df1)
#     id color phase   rand.col
# 1:  95 green   gas  1.5868335
# 2:  96 green   gas  0.5584864
# 3:  97 green   gas -1.2765922
# 4:  98 green   gas -0.5732654
# 5:  99 green   gas -1.2246126
# 6: 100 green   gas -0.4734006

one-liner:

setDT(df1)[df2, color := i.color, on = .(id, phase)]

Data:

set.seed(1L)
df1 <- data.frame(id = c(1:100), color = c(rep("blue", 25), rep("red", 25), 
                                           rep(NA, 50)), phase = c(rep("liquid", 50), rep("gas", 50)),
                  rand.col = rnorm(100))

df2 <- data.frame(id = c(51:100), color = rep("green", 50), phase = rep("gas", 50))

EDIT: based on new data posted in the question

Data:

set.seed(1L)
df1 = data.frame(id = c(1:100),  wq2 = rnorm(50), wq3 = rnorm(50), wq4 = rnorm(50), 
                 wq5 = rnorm(50))
set.seed(2423L)
df2 = data.frame(id = c(51:100), wq2 = rnorm(50), wq3 = rnorm(50), wq4 = rnorm(50), 
                 wq5 = rnorm(50))

Code:

library('data.table')
setDT(df1)[ id == 52, ]
#    id       wq2        wq3        wq4         wq5
# 1: 52 0.1836433 -0.6120264 0.04211587 -0.01855983

setDT(df2)[ id == 52, ]
#    id       wq2       wq3        wq4       wq5
# 1: 52 0.3917297 -1.007601 -0.6820783 0.3153687

df1[df2, `:=` ( wq2 = i.wq2,
                wq3 = i.wq3,
                wq4 = i.wq4,
                wq5 = i.wq5), on = .(id)]

setDT(df1)[ id == 52, ]
#    id       wq2       wq3        wq4       wq5
# 1: 52 0.3917297 -1.007601 -0.6820783 0.3153687
Sathish
  • 12,453
  • 3
  • 41
  • 59
  • This works perfectly thank you. I haven't worked with the data.table package before so I'll have to spend some time familiarizing myself with it. – C. Denney Feb 14 '18 at 17:14
  • Is there a way to replace multiple columns without listing them individually 1 by 1? In my real dataset, there will be about 15 water quality columns that I want to fill with and I'd rather avoid typing them all out individually. The ideal solution would be to tell it to take all columns from a given dataframe, or to be able to list a group of columns by number maybe. – C. Denney Feb 14 '18 at 19:15
  • Sorry, fixed. I had considered adding it in but decided it wasn't relevant to the question I was asking. It's been removed. – C. Denney Feb 15 '18 at 15:52