Distinct in data.table as in dplyr

Question

I am trying to use data.table for a better performance but dont know how to do the equivalent of distinct %>% summarize in dplyr. Any ideas how I could adapt the following code to data.table?

group_by_('x,y,z') %>%
distinct('h', .keep_all = TRUE) %>%
summarise(tot1 = sum(value1), tot2 = sum(value2))

Is this what you're looking for? https://stackoverflow.com/questions/11792527/filtering-out-duplicated-non-unique-rows-in-data-table — Grant, Aug 08 '18 at 16:45
maybe, but how to do it in with the bracket notation (group, distinct and sum)? — Fausto Carvalho Marques Silva, Aug 08 '18 at 17:13

score 5 · Accepted Answer · answered Aug 08 '18 at 17:30

5

You can do the group, distinct, and sum in 2 steps with data.table. First, use unique() with the by argument set to your grouping and distinct variables. Then do the data.table equivalent of summarize() with just the grouping variables.

dfq = data_frame(
    g1 = rep(c('a', 'b', 'c'), times = 12), 
    g2 = rep(c('d', 'e', 'f', 'g'), times = 9), 
    c3 = as.integer(30 * runif(36)), 
    d4 = rep(LETTERS[1:18], times = 2)
)

dtq = as.data.table(dfq)
dtq2 = unique(dtq, by = c("g1", "g2", "d4"))[
    , .(sum1 = sum(c3)), 
    by = c("g1", "g2")
]

answered Aug 08 '18 at 17:30

Grant

346
2
12

Thanks, it helped. Data.table is not well documented and explained like dplyr . – Fausto Carvalho Marques Silva Aug 08 '18 at 17:50
@Fausto Re documentation, some explanation and examples for aggregation are in `vignette("datatable-intro")` and I guess they're open to feedback on it (eg, maybe `unique` belongs there). – Frank Aug 08 '18 at 18:03

Distinct in data.table as in dplyr

1 Answers1