Big data frames, rbind and the second circle of hell

Question

I have some massive data sets containing data I must analyze and sort stepwise. Before, I created a function like this:

big.data.frame<-read.csv(data)
my.function<-function(big.data.frame){



 for(i in 1:100) {

 sort/analyze big.data.frame & store into a new data frame: my.df
 if(i==1) iteration.result<-my.df
 if(i>1) iteration.result<-rbind(my.df,iteration.result)

                 }
 return(iteration.result)

}

Running this function with my data takes 90 minutes. Its absolutely horrible. And my problem is that I am stuck in the second circle of hell, as written here: How can I prevent rbind() from geting really slow as dataframe grows larger?

Now, I have changed my code, from the recommendations of @joran in the link above, and The R Inferno:

 big.data.frame<-read.csv(data)
 my.function<-function(big.data.frame){



     my.list = vector(mode="list",100)
     for(i in 1:100) {

     sort/analyze big.data.frame & store into a new data frame: my.df
     my.list[[i]]<-my.df

                     }
     result <- do.call(rbind, my.list)
     return(result)

    }

But running this also took 90 minutes. I thought avoiding this incremental addition of rows should help, but no.

Note #1: there are 100 steps I need to take.

Note #2: My data are raw sets of data, stored in a data frame. I need to extract information, calculate and remodel the whole of the original data frame. So the new data frame in each iteration (which I called my.df) looks quite different from the original data set.

Note #3: This is most of the output from summaryRprof():

$by.total
                         total.time total.pct self.time self.pct
"my.function      "         5010.42    100.00      0.14     0.00
"unlist"                    4603.32     91.87   4593.42    91.68
"function1"                 2751.96     54.92      0.04     0.00
"function2"                 2081.26     41.54      0.02     0.00
"["                          229.72      4.58      0.08     0.00
"[.data.frame"               229.66      4.58     17.60     0.35
"match"                      206.96      4.13    206.64     4.12
"%in%"                       206.52      4.12      0.28     0.01
"aggregate"                  182.76      3.65      0.00     0.00
"aggregate.data.frame"       182.74      3.65      0.04     0.00
"lapply"                     177.86      3.55     35.84     0.72
"FUN"                        177.82      3.55     68.06     1.36
"mean.default"                38.36      0.77     35.68     0.71
"unique"                      31.90      0.64      4.74     0.09
"sapply"                      26.34      0.53      0.02     0.00
"as.factor"                   25.86      0.52      0.02     0.00
"factor"                      25.84      0.52      0.24     0.00
"split"                       25.22      0.50      0.10     0.00
"split.default"               25.12      0.50      2.60     0.05
"as.character"                19.30      0.39     19.26     0.38
"aggregate.default"           14.40      0.29      0.00     0.00
"simplify2array"              12.94      0.26      0.02     0.00
"eval"                         5.94      0.12      0.10     0.00
"list"                         5.16      0.10      5.16     0.10
"NextMethod"                   4.04      0.08      4.04     0.08
"transform"                    4.04      0.08      0.00     0.00
"transform.data.frame"         4.04      0.08      0.00     0.00
"=="                           3.90      0.08      0.02     0.00
"Ops.factor"                   3.88      0.08      0.92     0.02
"sort.list"                    3.74      0.07      0.12     0.00
"[.factor"                     3.62      0.07      0.00     0.00
"match.arg"                    3.60      0.07      0.96     0.02
"ifelse"                       3.34      0.07      0.66     0.01
"levels"                       2.54      0.05      2.54     0.05
"noNA.levels"                  2.52      0.05      0.00     0.00
"data.frame"                   1.78      0.04      0.52     0.01
"is.numeric"                   1.54      0.03      1.54     0.03
"deparse"                      1.40      0.03      0.12     0.00
".deparseOpts"                 1.28      0.03      0.02     0.00
"formals"                      1.24      0.02      1.22     0.02
"as.data.frame"                1.24      0.02      0.00     0.00

Note #4: I see that the function "unlist" takes a lot of time, if I interpreft things correctly. My function my.function actually takes a list as an argument:

my.function<-function(data,...) {
dots<-list(...)
dots<-unlist(dots)

... etc

}

Did you profile your function? This may show more explicitly where the bottleneck is. See `Rprof`. — coffeinjunky, Jul 23 '14 at 08:49
Moreover, do the steps depend on each other? Do the results of iteration 3 depend on the results of iteration 2? — coffeinjunky, Jul 23 '14 at 09:03
@coffeinjunky No have not tried Rpfrof. I will look into that, thank you. As for the steps, no, the steps are independent - they just create separate data frames. The "raw" data has a lot of variables and shows development of the variables over time, so the data frames are dependent in a statistical sense, but I assume this is not what you mean. — Helen, Jul 23 '14 at 09:13
@ErosRam see also this: [Concatenating a list of data frames](http://www.exegetic.biz/blog/2014/06/concatenating-a-list-of-data-frames/) — rcs, Jul 23 '14 at 09:22
Did you tried with passing your big dataframe as data.table, write your treatment function (output is a new data.table) as f and finally rbindlist(lapply(1:100,f))? Any dummy but concrete big.data.frame example and some operation examples you want to execute would be extremely usefull. — Colonel Beauvel, Jul 23 '14 at 09:27
1. Please profile your code. You think you are in the second circle of hell, but you don't actually know that. Other parts of your code may be slow. Use `Rprof(); your.function; Rprof(NULL)` and look at the result for `summaryRprof()` to find out which parts are slow. 2. If the iterations are independent, i.e. `embarassingly parallel`, you may want to run them in a parallel fashion. See e.g. `help(package="parallel)`. 3. Use `data.table` whenever possible. My gut feeling is that your read.csv and data manipulation are slow, not necessarily other things. `data.table` is your friend here. — coffeinjunky, Jul 23 '14 at 09:29
I believe the problem is that you're not really pre-allocating much space with `my.list = vector(mode="list",100)`; it has no idea of the size of the 100 lists that you'll fill in later. Maybe you could try `my.list = replicate(100, largedf, simplify=FALSE)` where largedf is a dummy data.frame e.g. filled with NAs, of sufficient size to hold your data. — baptiste, Jul 23 '14 at 09:34
Thank you for all your replies. I will look into each one of them immediately and try to get back with an answer — Helen, Jul 23 '14 at 09:37
Btw, the read.csv function is actually not inside the function, I just added it to show that I have used read.csv. All the different data sets I am working with is implemented beforehand. But you guys are correct, read.csv on my data is super slow! — Helen, Jul 23 '14 at 09:44
I think your "sort/analyze big.data.frame & store into a new data frame: my.df" step may be the culprit really. If you are sorting integers, try `sort.int(,method='quick')`. Also make sure your sort/analysis step doesn't contain any loop that can be vectorized. If there are any subsetting, matching, searching, convert everything to `data.table` — Vlo, Jul 23 '14 at 13:36
@coffeinjunky : Just finished the Rprof-ing. Took a lot of time. Updated the question above with the results, if you would be so kind and look at it. — Helen, Jul 23 '14 at 13:39
@Vlo: I am sorting all kinds of stuff, really. I tried switching from do.call(rbind,my.df) to rbindlist(my.df), suggested in the link of user rcs above, but did not change anything. So maybe I am not in the second circle of hell after all, as coffeinjunky pointed out before. — Helen, Jul 23 '14 at 13:44
Hong Ooi is right, I think seeing the actual code would help. Having said that, I think that converting everything to a `data.table` would help as well, given that much time is spent on `data.frame`-actions. Having said that, how big is your data actually? — coffeinjunky, Jul 23 '14 at 16:27
I shall post the actual code when I get back and check how big the data is exactly, because I left for a small vacation 1 hour ago. I shall try to use data.table, as well. Thank you very much for helping me. Is there anything that can be read from the summaryRprof() above? — Helen, Jul 23 '14 at 18:23

Big data frames, rbind and the second circle of hell

0 Answers0