Using the data.table package, which will deliver the fastest user time? (R Programming)

Question

I had this question in a test in an online training. I got it right just by trial and error. Most likely, I am doing something wrong because I am a beginner in R programming.

This is the question:

Before you read my R code, take note that for the last part I had to convert all the columns to numeric, because without that I was getting the following message:

"Error in rowMeans(DT) : 'x' must be numeric."

In the test, my professor solution is: "DT[,mean(pwgtp15), by=SEX]"

With my R code, the right answer is mean(DT$pwgtp15, by=DT$SEX).

I get this output:

My doubt is that maybe my way to make work DT[,mean(pwgtp15), by=SEX] produces a slow computation.

For that, I used

DT <- data.frame(data.matrix(DT))

Which one is the right answer? The professor solution? My answer? Another one?

Here is my code:

#THE SOLUTION IS DT[,mean(pwgtp15), by=SEX]
#HOWEVER, my solution is mean(DT$pwgtp15, by=DT$SEX)

install.packages("data.table")

library("data.table")

# the example below runs 100 times
download.file("https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06pid.csv", destfile = "ACS.csv")

DT <- fread("ACS.csv", sep = ",")



counter<- 0
myName<-"DT[,mean(pwgtp15), by=SEX]"
for (i in 1:100)
{
  a<- Sys.time()  
  DT[,mean(pwgtp15), by=SEX]
  b<-Sys.time()
  myTime<-b-a
  counter<- counter + myTime
}
cat("counter is: ", counter, "myName is: ", myName, "\n")



counter<- 0
myName<-"mean(DT[DT$SEX==1,]$pwgtp15);mean(DT[DT$SEX==2,]$pwgtp15)"
for (i in 1:100)
{
  a<- Sys.time()  
  mean(DT[DT$SEX==1,]$pwgtp15); mean(DT[DT$SEX==2,]$pwgtp15)
  b<-Sys.time()
  myTime<-b-a
  counter<- counter + myTime
}
cat("counter is: ", counter, "myName is: ", myName, "\n")



counter<- 0
myName<-"sapply(split(DT$pwgtp15,DT$SEX),mean)"
for (i in 1:100)
{
  a<- Sys.time()  
  sapply(split(DT$pwgtp15,DT$SEX),mean)
  b<-Sys.time()
  myTime<-b-a
  counter<- counter + myTime
}
cat("counter is: ", counter, "myName is: ", myName, "\n")



counter<- 0
myName<-"tapply(DT$pwgtp15, DT$SEX, mean)"
for (i in 1:100)
{
  a<- Sys.time()  
  tapply(DT$pwgtp15, DT$SEX, mean)
  b<-Sys.time()
  myTime<-b-a
  counter<- counter + myTime
}
cat("counter is: ", counter, "myName is: ", myName, "\n")



counter<- 0
myName<-"mean(DT$pwgtp15, by=DT$SEX)"
for (i in 1:100)
{
  a<- Sys.time()  
  mean(DT$pwgtp15, by=DT$SEX)
  b<-Sys.time()
  myTime<- b-a
  counter<- counter + myTime
}
cat("counter is: ", counter, "myName is: ", myName, "\n")



#We convert the entire DATAFRAME to numeric
#Otherwise rowmeans will not work
DT <- data.frame(data.matrix(DT))


counter<- 0
myName<-"rowMeans(DT)[DT$SEX==1];rowMeans(DT)[DT$SEX==2]"

for (i in 1:100)
{
  a<- Sys.time()  
  rowMeans(DT)[DT$SEX==1];rowMeans(DT)[DT$SEX==2]
  b<-Sys.time()
  myTime<- b-a
  counter<- counter + myTime
}
cat("counter is: ", counter, "myName is: ", myName, "\n")

Hmmmm...I cannot get `mean(..., by=...)` to work beyond showing a single `mean` value. *by* argument is entirely ignored. Docs do not show either. Please show `packageVersion("data.table")`. — Parfait, Feb 27 '20 at 18:49
@Parfait Thank you for your help! Did you use? DT <- fread("ACS.csv", sep = ",") — Beginner, Feb 27 '20 at 19:05
@Parfait When I use packageVersion("data.table"), I get ‘1.12.8’ — Beginner, Feb 27 '20 at 19:07
Yes and the return of `mean(DT$pwgtp15, by=DT$SEX)` is one single value not like others split by `SEX`. — Parfait, Feb 27 '20 at 19:41
@Parfait When you use it outside of a loop you get a single mean value. As I am measuring the time that R uses for the calculation, the code registers the time after and before. The difference is the time for one calculation. The process is done 100 times, and those times are added. — Beginner, Feb 27 '20 at 19:42
@ Sorry. I just understood your comment. You are right. Just one value appears. Maybe that is why it is the fastest. — Beginner, Feb 27 '20 at 19:45
Never mind the times. All the methods you are testing should return the same output by themselves. `mean()` does not. The *by* argument is ignored. This is exactly why it is so much faster because it does not split and calculate. Actually, even `rowMeans` errs out for me. BTW - `aggregate(pwgtp15 ~ SEX, DT, mean)` competes with the `DT[...]` call! — Parfait, Feb 27 '20 at 19:47
@Parfait With rowMeans I got an error, but after the conversion to numeric that is used in the code, the error disappears. — Beginner, Feb 27 '20 at 19:49
@Parfait However, after the conversion, the process is the slowest. — Beginner, Feb 27 '20 at 20:08
@Parfait Thank you! I just understood your comment about rowMeans. Thus, the question is just wrong!!! — Beginner, Feb 27 '20 at 20:20
Understood. Reach out to your professor on learned knowledge! Happy coding! — Parfait, Feb 27 '20 at 20:27

score 2 · Accepted Answer · answered Feb 27 '20 at 20:19

As discussed, answer choices to the question do not render the same results. There is no named parameter of by to base::mean(). Since the function allows for further arguments passed to or from other methods, it does not error out on by argument. Therefore, since it does not split/subset by factors like DT$SEX, it would be the fastest time.

Additionally, there are reasons for the other methods returning slower times:

tapply(...), sapply(split(...)), rowMeans(...)

All are direct or indirect apply family members which are hidden loops and not fully vectorized computations. Also, rowMeans is a wrapper to apply and is called twice. Plus, apply is infamously known to cast entire data frame/table to matrix where we should heed @DavidArenburg's caveat:

If you are working with data.frames, forget there is a function called apply- whatever you do - don't use it. Especially with a margin of 1 (the only good usecase for this function is to operate over matrix columns- margin of 2).

mean(...); mean(...)

This makes two calls on subsetted data frames. The logical indexing with [ return all columns of data frame, then $ selects final numeric column for mean().

In fact, it would be much faster and perhaps the fastest if you run vector subsets and not data frame subsets that returns all columns:

mean(DT$pwgtp15[DT$SEX==1]);mean(DT$pwgtp15[DT$SEX==2])

a <- Sys.time() 
DT[,mean(pwgtp15), by=SEX]
b <- Sys.time() 
myTime <- b-a
myTime
# Time difference of 0.01888704 secs
# Time difference of 0.03294992 secs
# Time difference of 0.03321409 secs

a <- Sys.time() 
mean(DT$pwgtp15[DT$SEX==1]);mean(DT$pwgtp15[DT$SEX==2])
b <- Sys.time() 
myTime <- b-a
myTime
# Time difference of 0.006003857 secs
# Time difference of 0 secs
# Time difference of 0 secs

Excellent! Thank you so much! I have a better understanding now. The conclusion is that the question was not properly asked. Or, maybe the professor did it on purpose to make us work harder :) — Beginner, Feb 27 '20 at 20:27

Using the data.table package, which will deliver the fastest user time? (R Programming)

1 Answers1