Group by with data.table using sum

Question

I have a dataframe that I want to group by users and find sum of quantity.

library(data.table)
x = read.table('C:/Users/user/Desktop/20180911_Dataset_b.csv',encoding = 'UTF-8',sep =',')

dt = data.table(x)

colnames(dt)
"dates_d" "user" "proj" "quantity"

the column quantity is like this:

quantity
1
34
12
13
3
12
-
11
1

I heard that data.table library is very fast so I would like to use that.

I have make it in Python but don't know how to do it in R.

Kindly refer this link: https://stackoverflow.com/questions/1299871/how-to-join-merge-data-frames-inner-outer-left-right for the benchmarking results. — Saurabh Chauhan, Sep 13 '18 at 09:36
u might want to use `read.table(..., stringsAsFactors=FALSE)` then `dt[, sum(quantity), by=.(user)]` — chinsoon12, Sep 13 '18 at 10:15
@chinsoon12 gives `Type 'character' not supported by GForce sum (gsum). Either add the prefix base::sum(.) or turn off GForce optimization using options(datatable.optimize=1) ` — user10357467, Sep 13 '18 at 10:19
you can convert into numeric first. `dt[, sum(as.numeric(quantity), na.rm=TRUE), by=.(user)]` — chinsoon12, Sep 13 '18 at 10:21
This is correct.Make an answer and I'll accept it.Just a few questions: how the '-' dash became 0? I mean it is fine but did the strings asfactors turn it to na and then na.rm turned it to 0? Explain these tricky parts in your answer.Thanks — user10357467, Sep 13 '18 at 10:26

chinsoon12 · Answer 1 · 2018-09-17T00:53:44.537

Due to historical memory limitation issues, R reads data as factors. When there is a character-like entry in a column, the whole column is read in as a character vector. Now with RAM more easily available, you can just read in data as string first so that it remains as a character vector rather than factor.

Then use as.numeric to convert into a real valued number before summing. Strings that cannot be converted into numbers are converted into NA instead. na.rm=TRUE ignores NAs in the sum.

Taking all of the above:

library(data.table)
#you might want to check out the data.table::fread function to read the data directly as a data.table
x = read.table('C:/Users/user/Desktop/20180911_Dataset_b.csv',encoding = 'UTF-8',sep =',', stringsAsFactors=FALSE)

setDT(x)[, sum(as.numeric(quantity), na.rm=TRUE), by=.(user)]

Reference: a useful comment from phiver in Is there any good reason for columns to be characters instead of factors? linking to a blog by Roger Peng: https://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/

Rendy Eza Putra · Answer 2 · 2018-09-13T10:03:30.730

0

library(dplyr)

dt[dt == "-" ] = NA

df <- dt %>% group_by(user) %>%
        summarise(qty = sum(!is.na(quantity)))

edited Sep 13 '18 at 10:03

answered Sep 13 '18 at 09:29

Rendy Eza Putra

101
9

Evaluation error: sum not meaningful for factors. maybe the fault is that the column apart from numbers has also dash (-) when a value is not available for a user. – user10357467 Sep 13 '18 at 09:32
can u include the data into the question? – Rendy Eza Putra Sep 13 '18 at 09:34
okay, so the first step is changing the dash value into NA. And the rest is same – Rendy Eza Putra Sep 13 '18 at 10:04
What this shows is more like a count and not the sum.Your result is the same as if i use count in python and if i use sum gives different results. – user10357467 Sep 13 '18 at 10:09

Group by with data.table using sum

2 Answers2