3

I have two data frames:

df1 = data.frame(index=c(0,3,4),n1=c(1,2,3))  
df1
#  index n1
# 1    0  1
# 2    3  2
# 3    4  3

df2 = data.frame(index=c(1,2,3),n2=c(4,5,6))  
df2
#   index n2
# 1     1  4
# 2     2  5
# 3     3  6

I want to join these to:

  index n
1     0 1
2     1 4
3     2 5
4     3 8  (index 3 in two df, so add 2 and 6 in each df)
5     4 3
6     5 0  (index 5 not exists in either df, so set 0)
7     6 0  (index 6 not exists in either df, so set 0)

The given data frames are just part of large dataset. Can I do it using dplyr or other packages in R?

Henrik
  • 65,555
  • 14
  • 143
  • 159
field210
  • 557
  • 5
  • 11

3 Answers3

2

Using data.table (would be efficient for bigger datasets). I am not changing the column names, as the rbindlist uses the name of the first dataset ie. in this case n from the second column (Don't know if it is a feature or bug). Once you join the datasets by rbindlist, group it by column index i.e. (by=index) and do the sum of n column (list(n=sum(n)) )

 library(data.table)
 rbindlist(list(data.frame(index=0:6,n=0), df1,df2))[,list(n=sum(n)), by=index]
     index n
 #1:     0 1
 #2:     1 4
 #3:     2 5
 #4:     3 8
 #5:     4 3
 #6:     5 0
 #7:     6 0

Or using dplyr. Here, the column names of all the datasets should be the same. So, I am changing it before binding the datasets using rbind_list. If the names are different, there will be multiple columns for each name. After joining the datasets, group it by index and then use summarize and do the sum of column n.

 library(dplyr)
 nm1 <- c("index", "n") 
 colnames(df1) <- colnames(df2) <- nm1 
 rbind_list(df1,df2, data.frame(index=0:6, n=0)) %>% 
                                          group_by(index) %>%
                                          summarise(n=sum(n))
akrun
  • 874,273
  • 37
  • 540
  • 662
0

This is something you could do with the base functions aggregate and rbind

df1 = data.frame(index=c(0,3,4),n=c(1,2,3))  
df2 = data.frame(index=c(1,2,3),n=c(4,5,6))  
aggregate(n~index, rbind(df1, df2, data.frame(index=0:6, n=0)), sum)

which returns

  index n
1     0 1
2     1 4
3     2 5
4     3 8
5     4 3
6     5 0
7     6 0
MrFlick
  • 195,160
  • 17
  • 277
  • 295
0

How about

names(df1) <- c("index", "n")                # set colnames of df1 to target
df3 <- rbind(df1,setNames(df2, names(df1)))  # set colnnames of df2 and join
df <- df3 %>% dplyr::arrange(index)          # sort by index

Cheers.

Boern
  • 7,233
  • 5
  • 55
  • 86