I have some massive data sets containing data I must analyze and sort stepwise. Before, I created a function like this:
big.data.frame<-read.csv(data)
my.function<-function(big.data.frame){
for(i in 1:100) {
sort/analyze big.data.frame & store into a new data frame: my.df
if(i==1) iteration.result<-my.df
if(i>1) iteration.result<-rbind(my.df,iteration.result)
}
return(iteration.result)
}
Running this function with my data takes 90 minutes. Its absolutely horrible. And my problem is that I am stuck in the second circle of hell, as written here: How can I prevent rbind() from geting really slow as dataframe grows larger?
Now, I have changed my code, from the recommendations of @joran in the link above, and The R Inferno:
big.data.frame<-read.csv(data)
my.function<-function(big.data.frame){
my.list = vector(mode="list",100)
for(i in 1:100) {
sort/analyze big.data.frame & store into a new data frame: my.df
my.list[[i]]<-my.df
}
result <- do.call(rbind, my.list)
return(result)
}
But running this also took 90 minutes. I thought avoiding this incremental addition of rows should help, but no.
Note #1: there are 100 steps I need to take.
Note #2: My data are raw sets of data, stored in a data frame. I need to extract information, calculate and remodel the whole of the original data frame. So the new data frame in each iteration (which I called my.df) looks quite different from the original data set.
Note #3: This is most of the output from summaryRprof():
$by.total
total.time total.pct self.time self.pct
"my.function " 5010.42 100.00 0.14 0.00
"unlist" 4603.32 91.87 4593.42 91.68
"function1" 2751.96 54.92 0.04 0.00
"function2" 2081.26 41.54 0.02 0.00
"[" 229.72 4.58 0.08 0.00
"[.data.frame" 229.66 4.58 17.60 0.35
"match" 206.96 4.13 206.64 4.12
"%in%" 206.52 4.12 0.28 0.01
"aggregate" 182.76 3.65 0.00 0.00
"aggregate.data.frame" 182.74 3.65 0.04 0.00
"lapply" 177.86 3.55 35.84 0.72
"FUN" 177.82 3.55 68.06 1.36
"mean.default" 38.36 0.77 35.68 0.71
"unique" 31.90 0.64 4.74 0.09
"sapply" 26.34 0.53 0.02 0.00
"as.factor" 25.86 0.52 0.02 0.00
"factor" 25.84 0.52 0.24 0.00
"split" 25.22 0.50 0.10 0.00
"split.default" 25.12 0.50 2.60 0.05
"as.character" 19.30 0.39 19.26 0.38
"aggregate.default" 14.40 0.29 0.00 0.00
"simplify2array" 12.94 0.26 0.02 0.00
"eval" 5.94 0.12 0.10 0.00
"list" 5.16 0.10 5.16 0.10
"NextMethod" 4.04 0.08 4.04 0.08
"transform" 4.04 0.08 0.00 0.00
"transform.data.frame" 4.04 0.08 0.00 0.00
"==" 3.90 0.08 0.02 0.00
"Ops.factor" 3.88 0.08 0.92 0.02
"sort.list" 3.74 0.07 0.12 0.00
"[.factor" 3.62 0.07 0.00 0.00
"match.arg" 3.60 0.07 0.96 0.02
"ifelse" 3.34 0.07 0.66 0.01
"levels" 2.54 0.05 2.54 0.05
"noNA.levels" 2.52 0.05 0.00 0.00
"data.frame" 1.78 0.04 0.52 0.01
"is.numeric" 1.54 0.03 1.54 0.03
"deparse" 1.40 0.03 0.12 0.00
".deparseOpts" 1.28 0.03 0.02 0.00
"formals" 1.24 0.02 1.22 0.02
"as.data.frame" 1.24 0.02 0.00 0.00
Note #4: I see that the function "unlist" takes a lot of time, if I interpreft things correctly. My function my.function actually takes a list as an argument:
my.function<-function(data,...) {
dots<-list(...)
dots<-unlist(dots)
... etc
}