2

I have two dataframes:

df1

  ua ub a b c
1 11 12 1 1 0
2 12 13 2 2 2
3 13 14 1 1 1
4 14 15 1 1 1
5 15 16 1 1 1
6 16 17 1 1 1

df2

  ua ub a b c d e f
1 11 12 1 1 0 1 1 1
2 14 15 1 1 1 1 1 1
3 16 17 1 1 1 1 1 1
4 12 13 2 2 2 2 2 2

I want to join the two dataframes on ua and ub, and add on the columns in df2 not in df1 onto df1, and "fill in" the missing rows with a 0.

This would give:

  ua ub a b c d e f
1 11 12 1 1 0 1 1 1
2 12 13 2 2 2 2 2 2
3 13 14 1 1 1 0 0 0
4 14 15 1 1 1 1 1 1
5 15 16 1 1 1 0 0 0
6 16 17 1 1 1 1 1 1

I can get the column names in df2 not in df1 by doing:

setdiff(names(df2), names(df1))

But I'm a bit stuck on how to then merge / concat the remaining results.

I looked at this q here but no luck making it work for me.

In Pandas Python, I could use merge combined with .ffill() but I'm unsure how to translate this into R. How can I do this?

Thanks for your help.

Chuck
  • 3,664
  • 7
  • 42
  • 76

2 Answers2

2

Assuming you want to retain all cases of df1 and not df2. Below could be used

library(dplyr)
df3 <- left_join(df1, df2)
df3[is.na(df3)] <- 0

if all cases of df1 as well as df2 is required then replace left_join with full join in above code

Kushdesh
  • 1,118
  • 10
  • 16
1

One possible solution is to use all column names that df1 and df2 have in common:

df1 = read.table(text = "
ua ub a b c
1 11 12 1 1 0
2 12 13 2 2 2
3 13 14 1 1 1
4 14 15 1 1 1
5 15 16 1 1 1
6 16 17 1 1 1
", header=T)

df2 = read.table(text = "
  ua ub a b c d e f
1 11 12 1 1 0 1 1 1
2 14 15 1 1 1 1 1 1
3 16 17 1 1 1 1 1 1
4 12 13 2 2 2 2 2 2
", header=T)

library(dplyr)

df1 %>%
  left_join(df2, by=c("ua", "ub", "a", "b", "c")) %>%
  mutate_all(function(x) ifelse(is.na(x), 0, x))

#   ua ub a b c d e f
# 1 11 12 1 1 0 1 1 1
# 2 12 13 2 2 2 2 2 2
# 3 13 14 1 1 1 0 0 0
# 4 14 15 1 1 1 1 1 1
# 5 15 16 1 1 1 0 0 0
# 6 16 17 1 1 1 1 1 1

Another way to do the above without manually specifying the common columns names is this

common_names = intersect(names(df1), names(df2))

df1 %>%
  left_join(df2, by=common_names) %>%
  mutate_all(function(x) ifelse(is.na(x), 0, x))

Otherwise, you can remove any columns from df2 that you don't want to join and have them twice in your final dataset and use by=c("ua", "ub") when you join:

names_to_use = c("ua", "ub", setdiff(names(df2), names(df1)))
df2_upd = df2[,names_to_use]

df1 %>%
  left_join(df2_upd, by=c("ua", "ub")) %>%
  mutate_all(function(x) ifelse(is.na(x), 0, x))
AntoniosK
  • 15,991
  • 2
  • 19
  • 32
  • When I try this on my larger df I get `Error in mutate_impl(.data, dots) : Evaluation error: Argument 2 must be type double, not integer.` – Chuck Dec 14 '17 at 14:05
  • Hmmm. That's because of the different variable types and `coalesce`. I'll update my solution. – AntoniosK Dec 14 '17 at 14:07
  • So columns in df1 with the same name as those in df2 had different types even though they had the same name? – Chuck Dec 14 '17 at 14:09
  • More likely that different columns had different types. `mutate_all` will apply the function to all columns and `coalesce` depends on the column types, when `ifelse` doesn't. – AntoniosK Dec 14 '17 at 14:11