-1

I have a dataframe that I want to group by users and find sum of quantity.

library(data.table)
x = read.table('C:/Users/user/Desktop/20180911_Dataset_b.csv',encoding = 'UTF-8',sep =',')

dt = data.table(x)

colnames(dt)
"dates_d" "user" "proj" "quantity"   

the column quantity is like this:

quantity
1
34
12
13
3
12
-
11
1

I heard that data.table library is very fast so I would like to use that.

I have make it in Python but don't know how to do it in R.

  • Kindly refer this link: https://stackoverflow.com/questions/1299871/how-to-join-merge-data-frames-inner-outer-left-right for the benchmarking results. – Saurabh Chauhan Sep 13 '18 at 09:36
  • u might want to use `read.table(..., stringsAsFactors=FALSE)` then `dt[, sum(quantity), by=.(user)]` – chinsoon12 Sep 13 '18 at 10:15
  • @chinsoon12 gives `Type 'character' not supported by GForce sum (gsum). Either add the prefix base::sum(.) or turn off GForce optimization using options(datatable.optimize=1) ` – user10357467 Sep 13 '18 at 10:19
  • you can convert into numeric first. `dt[, sum(as.numeric(quantity), na.rm=TRUE), by=.(user)]` – chinsoon12 Sep 13 '18 at 10:21
  • This is correct.Make an answer and I'll accept it.Just a few questions: how the '-' dash became 0? I mean it is fine but did the strings asfactors turn it to na and then na.rm turned it to 0? Explain these tricky parts in your answer.Thanks – user10357467 Sep 13 '18 at 10:26

2 Answers2

1

Due to historical memory limitation issues, R reads data as factors. When there is a character-like entry in a column, the whole column is read in as a character vector. Now with RAM more easily available, you can just read in data as string first so that it remains as a character vector rather than factor.

Then use as.numeric to convert into a real valued number before summing. Strings that cannot be converted into numbers are converted into NA instead. na.rm=TRUE ignores NAs in the sum.

Taking all of the above:

library(data.table)
#you might want to check out the data.table::fread function to read the data directly as a data.table
x = read.table('C:/Users/user/Desktop/20180911_Dataset_b.csv',encoding = 'UTF-8',sep =',', stringsAsFactors=FALSE)

setDT(x)[, sum(as.numeric(quantity), na.rm=TRUE), by=.(user)]

Reference: a useful comment from phiver in Is there any good reason for columns to be characters instead of factors? linking to a blog by Roger Peng: https://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/

chinsoon12
  • 25,005
  • 4
  • 25
  • 35
0
library(dplyr)

dt[dt == "-" ] = NA

df <- dt %>% group_by(user) %>%
        summarise(qty = sum(!is.na(quantity)))