3

I have some large data sets and am trying out data.table to combine them while summing up the shared column over matching rows. I know how to merge using [ matching rows in the LHS data.table as shown below with tables a2:LHS and a:RHS

a2 <- data.table( b= c(letters[1:5],letters[11:15]), c = as.integer(rep(100,10)))
a <- data.table(b = letters[1:10], c = as.integer(1:10))
setkey(a2 ,"b")
setkey(a , "b")

a2
    b   c
 1: a 100
 2: b 100
 3: c 100
 4: d 100
 5: e 100
 6: k 100
 7: l 100
 8: m 100
 9: n 100
10: o 100

a
    b  c
 1: a  1
 2: b  2
 3: c  3
 4: d  4
 5: e  5
 6: f  6
 7: g  7
 8: h  8
 9: i  9
10: j 10

from second answer hereMerge data frames whilst summing common columns in R I saw how columns could be summed up over matching rows, as such:

setkey(a , "b")
setkey(a2, "b")
a2[a, `:=`(c = c + i.c)]
a2
    b   c
 1: a 101
 2: b 102
 3: c 103
 4: d 104
 5: e 105
 6: k 100
 7: l 100
 8: m 100
 9: n 100
10: o 100

However I am trying retain the rows that don't match as well.

Alternately I could use merge as shown below but I would like a void making a new table with 4 rows before reducing it to 2 rows.

c <- merge(a, a2, by = "b", all=T)
c <- transform(c, value = rowSums(c[,2:3], na.rm=T))
c <- c[,c(1,4)]

c
    b value
 1: a   102
 2: b   104
 3: c   106
 4: d   108
 5: e   110
 6: f     6
 7: g     7
 8: h     8
 9: i     9
10: j    10
11: k   100
12: l   100
13: m   100
14: n   100
15: o   100

This last table is what I would like to achieve, Thanks in Advance.

Community
  • 1
  • 1
Bhail
  • 385
  • 1
  • 2
  • 18

1 Answers1

2

merge is likely to not be very efficient for the end result you are after. Since both of your data.tables have the same structure, I would suggest rbinding them together and taking the sum by their key. In other words:

rbindlist(list(a, a2))[, sum(c), b]

I've used rbindlist because it is generally more efficient at rbinding data.tables (even though you have to first put your data.tables in a list).


Compare some timings on larger datasets:

library(data.table)
library(stringi)
set.seed(1)
n <- 1e7; n2 <- 1e6
x <- stri_rand_strings(n, 4)
a2 <- data.table(b = sample(x, n2), c = sample(100, n2, TRUE))
a <- data.table(b = sample(x, n2), c = sample(10, n2, TRUE))

system.time(rbindlist(list(a, a2))[, sum(c), b])
#   user  system elapsed 
#   0.83    0.05    0.87 

system.time(merge(a2, a, by = "b", all = TRUE)[, rowSums(.SD, na.rm = TRUE), b]) # Get some coffee
#   user  system elapsed 
# 159.58    0.48  162.95 

## Do we have all the rows we expect to have?
length(unique(c(a$b, a2$b)))
# [1] 1782166

nrow(rbindlist(list(a, a2))[, sum(c), b])
# [1] 1782166
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • `rbind(a, a2)` will dispatch to `rbind.data.table`, I guess, so no need to construct a list / pass to rbindlist, I guess. – Frank Mar 22 '17 at 05:22
  • 2
    `rbindlist(list(a, a2))[, sum(c), b]` – Bhail Mar 22 '17 at 08:01
  • In `rbindlist(list(a, a2))[, sum(c), b]` I am guessing here `[, sum(c), b]` is a data.table feature and its passing `sum()` by columns assigned as id or key, another feature of data.table. But I am still a little unsure I will be able to use `[` feature intuitively unless looking at example, I would appreciate help with grasping the language of data.table here. – Bhail Mar 22 '17 at 08:10
  • 1
    @Bhail, `rbindlist(list(a, a2))` just creates a single `data.table` with "a2" appended as rows after "a". You can use `[` directly, rather than first storing that output as a new table and then working on it. The `[, sum(c), b]` is the standard approach you would use with a `data.table` and can be read as "take the sum of variable c grouped by variable b". – A5C1D2H2I1M1N2O1R2T1 Mar 22 '17 at 08:19
  • @A5C1D2H2I1M1N2O1R2T1, Thanks. This a good, I get a taste of why data.table is consider a must have by some avid R users. – Bhail Mar 22 '17 at 08:36