R: looping over large dataset takes a long time

Question

I need to apply the Mann Kendall trend test in R to a big number (about 1 million) of different-sized time series. I've created a .txt file like this:

Every number of the firs column identifies a particular time-series. To obtain the statistics I'm using this code:

library(Kendall)
a=read.table("to_r.txt")
numData=1017135

for (i in 1:numData){

s1=subset(a,a$V1==i)
m=MannKendall(s1$V2)
cat(m[[1]],"  ",m[[2]], "  ", m[[3]],"  ",m[[4]],"  ", m[[5]], "\n" ,   file="monotonic_trend_checking.txt",append=TRUE)
}

This approach works but the problem is that it is taking ages for computation. The time-series are 800 elements long at maximum (only few, the other are shorter) so I think that the main bottle-neck is in the loop. Can you suggest a faster approach?

Looping or not, reading a million text files will take a long time. See this other question for ways to speed that up. http://stackoverflow.com/questions/1727772/quickly-reading-very-large-tables-as-dataframes-in-r/1728422 — Richie Cotton, Dec 12 '11 at 11:11
Also, this looks like a very parallelisable problem, so now might be a good time to look at one of the map/reduce packages. http://cran.r-project.org/web/views/HighPerformanceComputing.html — Richie Cotton, Dec 12 '11 at 11:20
General suggestions on speeding up R code: http://stackoverflow.com/a/8474941/636656 — Ari B. Friedman, Dec 12 '11 at 13:18

score 4 · Answer 1 · answered Dec 12 '11 at 10:45

4

I would definitely take a look at data.table if you think that looping is the problem here. I recently wrote a blog post which compares several ways of looping in terms of performance:

http://www.numbertheory.nl/2011/10/28/comparison-of-ave-ddply-and-data-table/

The main message is that data.table performs really well when the number of unique id's (i.e. your number of timeseries) is very large.

answered Dec 12 '11 at 10:45

Paul Hiemstra

59,984
12
142
149

Sorry for the shameless self promotion :) – Paul Hiemstra Dec 12 '11 at 10:45

score 3 · Accepted Answer · answered Dec 12 '11 at 11:31

3

Here's a very simple solution that splits your dataset into pieces and then applies MannKendall to each piece.

# example dataset
d <- data.frame(a=rnorm(1e6), b=rep(1:1000, each=1000))

mk <- lapply(split(d$a, d$b), MannKendall)

This takes about ten seconds on my machine, which is a moderately powerful desktop running Windows 7. The result is a list; here's the first component:

> mk[[1]]
tau = 0.0319, 2-sided pvalue =0.13133

If you want it all in a data frame:

mk <- do.call(rbind, mk)

answered Dec 12 '11 at 11:31

Hong Ooi

56,353
13
134
187

This can be done in even less code using ddply from the plyr package, this combines the do.call and the lapply in one. Moreover, it provides support for parallel processing and provides a progress bar. – Paul Hiemstra Dec 12 '11 at 12:02
2

One thing to be wary of with plyr is that it doesn't cope well with large numbers of levels (because of its underlying reliance on data frames). While 1000 might still be okay, 10000 is into the no-go zone. So for simple tasks like this, I'd prefer to stick with base R. – Hong Ooi Dec 12 '11 at 12:10
I definitely agree, see also my reply to this question. For large numbers of levels I'd go for data.table, which is much faster even than the base R solution. – Paul Hiemstra Dec 12 '11 at 12:22

score 1 · Answer 3 · answered Dec 12 '11 at 11:17

A big problem is that you're writing the results to a file for every single time series. File access takes time. It would be better to create a large data table with all of the results and then use write.table() once at the very end.

This, of course, assumes that you have enough RAM to hold the original data table and the new results table in memory, which probably won't be a huge problem.

R: looping over large dataset takes a long time

3 Answers3