How to merge two data frames on common columns in R with sum of others?

Question

R Version 2.11.1 32-bit on Windows 7

I got two data sets: data_A and data_B:

data_A

USER_A USER_B ACTION
1      11     0.3
1      13     0.25
1      16     0.63
1      17     0.26
2      11     0.14
2      14     0.28

data_B

USER_A USER_B ACTION
1      13     0.17
1      14     0.27
2      11     0.25

Now I want to add the ACTION of data_B to the data_A if their USER_A and USER_B are equal. As the example above, the result would be:

data_A

USER_A USER_B ACTION
1      11     0.3
1      13     0.25+0.17
1      16     0.63
1      17     0.26
2      11     0.14+0.25
2      14     0.28

So how could I achieve it?

Normally, I'd use awk or perl for this. Would such a solution be ok? — Peter G., Apr 24 '11 at 09:14
Possible duplicate of [How to merge and sum two data frames](https://stackoverflow.com/questions/38468502/how-to-merge-and-sum-two-data-frames) — divibisan, May 14 '19 at 17:40
While the above linked question is newer, it has more a number of nicer, more current approaches. — divibisan, May 14 '19 at 17:42

Andrie · Accepted Answer · 2011-04-24T15:07:55.107

19

You can use ddply in package plyr and combine it with merge:

library(plyr)
ddply(merge(data_A, data_B, all.x=TRUE), 
  .(USER_A, USER_B), summarise, ACTION=sum(ACTION))

Notice that merge is called with the parameter all.x=TRUE - this returns all of the values in the first data.frame passed to merge, i.e. data_A:

  USER_A USER_B ACTION
1      1     11   0.30
2      1     13   0.25
3      1     16   0.63
4      1     17   0.26
5      2     11   0.14
6      2     14   0.28

edited Apr 24 '11 at 15:07

answered Apr 24 '11 at 12:33

Andrie

176,377
47
447
496

3

That output isn't quite what the OP had - notice you have one extra row than the OP wanted. We need `data_A` but with an update to two of the `ACTION` entries. The equivalent base R of your answer would be: `aggregate(ACTION ~ USER_B + USER_A, data = rbind(data_A, data_B), FUN = sum)[, c(2,1,3)]` but I discounted this because it wasn't an update of `data_A`. – Gavin Simpson Apr 24 '11 at 13:12
1

@GavinSimpson Thank you for spotting this. I have now modified the code to use merge, rather than rbind. – Andrie Apr 24 '11 at 13:19
Dies quick on two sets of 3M rows each after consuming 2Gb of ram. – Artem Oboturov Dec 30 '12 at 23:10
1

@ArtemOboturov If you want faster processing with lower memory consumption, try the data.table package – Andrie Dec 30 '12 at 23:15
2

am I missing something? OP asked the action column value to be summed, but this answer doesn't? – kevin0228ca Aug 16 '17 at 05:54
@kevin0228ca If I replace the "merge" with rbind I get something that works. – jwimberley May 03 '18 at 18:50

score 16 · Answer 2 · answered Apr 24 '11 at 11:13

This sort of thing is quite easy to do with a database-like operation. Here I use package sqldf to do a left (outer) join and then summarise the resulting object:

require(sqldf)
tmp <- sqldf("select * from data_A left join data_B using (USER_A, USER_B)")

This results in:

> tmp
  USER_A USER_B ACTION ACTION
1      1     11   0.30     NA
2      1     13   0.25   0.17
3      1     16   0.63     NA
4      1     17   0.26     NA
5      2     11   0.14   0.25
6      2     14   0.28     NA

Now we just need sum the two ACTION columns:

data_C <- transform(data_A, ACTION = rowSums(tmp[, 3:4], na.rm = TRUE))

Which gives the desired result:

> data_C
  USER_A USER_B ACTION
1      1     11   0.30
2      1     13   0.42
3      1     16   0.63
4      1     17   0.26
5      2     11   0.39
6      2     14   0.28

This can be done using standard R function merge:

> merge(data_A, data_B, by = c("USER_A","USER_B"), all.x = TRUE)
  USER_A USER_B ACTION.x ACTION.y
1      1     11     0.30       NA
2      1     13     0.25     0.17
3      1     16     0.63       NA
4      1     17     0.26       NA
5      2     11     0.14     0.25
6      2     14     0.28       NA

So we can replace the sqldf() call above with:

tmp <- merge(data_A, data_B, by = c("USER_A","USER_B"), all.x = TRUE)

whilst the second line using transform() remains the same.

You could simply add these in the SQL query, and then there's no need for the `transform`. *e.g.* `SELECT A.USER_A, B.USER_B, A.ACTION + B.ACTION AS ACTION FROM data_A A INNER JOIN data_B B ON A.USER_A = B.USER_B` — Matt, Sep 07 '17 at 01:31
This answer works while I tried a lot with accepted answer without achieving...+1 — MysteryGuy, Apr 09 '18 at 13:00

moodymudskipper · Answer 3 · 2022-04-14T10:52:26.450

3

We can use {powerjoin}:

library(powerjoin)
power_left_join(
  data_A,  data_B, by = c("USER_A", "USER_B"), 
  conflict = ~ .x + ifelse(is.na(.y), 0, .y)
)
#>   USER_A USER_B ACTION
#> 1      1     11   0.30
#> 2      1     13   0.42
#> 3      1     16   0.63
#> 4      1     17   0.26
#> 5      2     11   0.39
#> 6      2     14   0.28

In case of conflict, the function fed to the conflict argument will be used on pairs of conflicting columns.

We can also use sum(, na.rm = TRUE) row-wise for the same effect :

power_left_join(data_A,data_B, by = c("USER_A", "USER_B"), 
                conflict = rw ~ sum(.x, .y, na.rm = TRUE))

edited Apr 14 '22 at 10:52

answered Mar 02 '19 at 23:36

moodymudskipper

46,417
11
121
167

1

I really like this nice option, thanks. Also possible in case of incomparable 'by' to sum nicely: ~ifelse(is.na(.x),0,.x)+ifelse(is.na(.y),0,.y) – Gildas Jan 24 '20 at 14:40

How to merge two data frames on common columns in R with sum of others?

data_A

data_B

data_A

3 Answers3

Linked

Related