1

I have two dataframes:

Dataframe #1

A   B   C   D   E 
2   1   0   5   7       

and

Dataframe #2

C   E   F   G 
3   1   0   9

I would like to combine the two dataframes such that the 1st dataframe serves as the reference and the columns in common are added together

The solution should look like:

A   B   C   D   E 
2   1   3   5   8

Notice the elements of the shared columns (columns C and E) were added and the extra columns in Dataframe #2 (columns F and G) were ignored because they were not found in Dataframe #1. Each Dataframe has a single row.

I have tried accomplishing this with the native functions in R as well as the merge package but I've had no luck. In my application I need to combine thousands of columns many of which are shared between the two dataframes so I'm working on developing a robust solution.

Thanks!

Andrew Brown
  • 333
  • 3
  • 12
  • Possible duplicate of [How to merge two data frames on common columns in R with sum of others?](http://stackoverflow.com/questions/5769320/how-to-merge-two-data-frames-on-common-columns-in-r-with-sum-of-others) – neilfws Mar 19 '17 at 23:37
  • `overlap = intersect(names(df1), names(df2)); df1[overlap] = df1[overlap] + df2[overlap]`. Assumes the number of rows in the data frames are the same. – Gregor Thomas Mar 19 '17 at 23:44

2 Answers2

4

This should work:

overlap = intersect(names(df1), names(df2))
df1[overlap] = df1[overlap] + df2[overlap]

It assumes the data frames have the same number of rows.

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
1

It may be easier to convert from wide to long format first, do the operations and then convert back. For example, using dplyr and tidyr:

library(dplyr)
library(tidyr)
df1 <- data.frame(A = 2, B = 1, C = 0, D = 5, E = 7)
df2 <- data.frame(C = 3, E = 1, F = 0, G = 9)

df1 %>%
  gather(variable, value) %>%
  left_join(gather(df2, variable, value), by = "variable") %>%
  rowwise() %>%
  mutate(sum = sum(value.x, value.y, na.rm = TRUE)) %>%
  select(variable, sum) %>%
  spread(variable, sum)

What this does: puts variables from df1 into 1 column and their values in another; does the same to df2; combines df1 and df2 using df1 as reference; sums the values for each variable; selects the variable column and the new values (sums); converts back to wide format.

Result:

# A tibble: 1 × 5
      A     B     C     D     E
* <dbl> <dbl> <dbl> <dbl> <dbl>
1     2     1     3     5     8

In general when you have the same measurement for multiple variables in many columns, it's worth considering whether the long format is a better representation of the data.

neilfws
  • 32,751
  • 5
  • 50
  • 63