0

I have recently started exploring R.
I have a csv which has many rows, using which my goal is to calculate average of one column by grouping 3 other columns.
For small data, the code works pretty well, but when it comes to million rows, it takes time (around 10 sec).

Here is what I have tried.
Used fread to read only 4 columns from the large csv [Takes around 5 sec]

mydata <- fread("csv_data.csv", select = c("Col1", "Col2","Col3","Time"))

Used aggregate

aggregate(mydata$Time,by=list(mydata$Col1,mydata$Col2,mydata$Col3),mean,rm.na=TRUE)

The aggregate function takes around 6 sec for calculation and result generation.

I also tried sqldf, but takes more than 30 sec.

My goal is to find the average of time for a unique combination of Col1, Col2 and Col3

So my question is; how to calculate the average time for a unique combination of other three columns efficiently?

A3006
  • 1,051
  • 1
  • 11
  • 28
  • 1
    It is likely that you can use `data.table` to help you improve the speed of computation. It would be great if you can use `dput` to share your example dataset. – www Dec 02 '17 at 14:56
  • 1
    Hello @A3006. There is no question in your post. Could you please update your post to include a question for which you would like an answer? There are many ways to do a particular thing in R, some of which are less efficient than others. – Len Greski Dec 02 '17 at 15:01

2 Answers2

1

Here is a solution with generated data and timings showing that data.table() outperforms aggregate() on 1,000,000 rows of data.

> library(data.table)
data.table 1.10.4.2
  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
  Release notes, videos and slides: http://r-datatable.com
> # generate data
> col1 <- rep(1:10,100000)
> col2 <- rep(c(2,4,6,8,10),200000)
> col3 <- c(rep(c("A","B"),250000),rep(c("C","D"),250000))
> rate <- x <- rexp(1000000,rate=.2)
> 
> theData <- data.frame(col1,col2,col3,rate)
> 
> # aggregate with stats::aggregate()
> system.time(aggData1 <- aggregate(rate ~ col1 + col2 + col3,data=theData,mean,na.rm=TRUE))
   user  system elapsed 
   1.71    0.03    1.77 
>
> head(aggData1)
  col1 col2 col3     rate
1    1    2    A 5.021857
2    7    4    A 5.020089
3    3    6    A 4.991095
4    9    8    A 4.963572
5    5   10    A 5.001005
6    6    2    B 5.007399 
> 
> #use data.table
> theTable <- as.data.table(theData)
> 
> system.time(aggData2 <- theTable[,mean(rate,na.rm=TRUE),by=.(col1,col2,col3)])
   user  system elapsed 
   0.03    0.02    0.05 
>
> head(aggData2)
   col1 col2 col3       V1
1:    1    2    A 5.021857
2:    2    4    B 5.000865
3:    3    6    A 4.991095
4:    4    8    B 5.023281
5:    5   10    A 5.001005
6:    6    2    B 5.007399
> 
Len Greski
  • 10,505
  • 2
  • 22
  • 33
  • Thanks Len, when I checked with aggregate, it took 14 sec as against the data table, which took 5.4 sec. Out of which 5 sec was for importing file. Now I'm checking if I can do something to reduce that time as well. – A3006 Dec 02 '17 at 16:21
0

Here is a possible solution using data.table. Because you are already using fread, which is a function from data.table, it seems like data.table is a good choice. The computation speed is one of the advantages to use data.table.

library(data.table)
# Convert your data frame to data.table
setDT(mydata)
# Perform the calculation
mydata2 <- mydata[, .(Time_mean = mean(Time, na.rm = TRUE)), by = .(Col1, Col2, Col3)]

Since you did not provide any reproducible example of mydata, I have no ways to test the codes. Below I used the built-in dataset mtcars as an example.

library(data.table)

mydata <- mtcars

setDT(mydata)

head(mydata)
#     mpg cyl disp  hp drat    wt  qsec vs am gear carb
# 1: 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
# 2: 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
# 3: 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
# 4: 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
# 5: 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
# 6: 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Assuming that I want to calculate the mean mpg based on the grouping variable of am, gear, and carb, which mimics your task. Below is the code.

mydata2 <- mydata[, .(mpg_mean = mean(mpg, na.rm = TRUE)), by = .(am, gear, carb)]

mydata2
#  am gear carb mpg_mean
#  1:  1    4    4 21.00000
#  2:  1    4    1 29.10000
#  3:  0    3    1 20.33333
#  4:  0    3    2 17.15000
#  5:  0    3    4 12.62000
#  6:  0    4    2 23.60000
#  7:  0    4    4 18.50000
#  8:  0    3    3 16.30000
#  9:  1    4    2 25.90000
# 10:  1    5    2 28.20000
# 11:  1    5    4 15.80000
# 12:  1    5    6 19.70000
# 13:  1    5    8 15.00000
www
  • 38,575
  • 12
  • 48
  • 84
  • 1
    Thanks... Worked like a charm and reduced my computation time...My new total time has come down to 6.2 – A3006 Dec 02 '17 at 15:48