I am trying to parse a huge dataset into R (1.3Gb). The original data is a list comprised of four million of characters, being each one of them an observation of 137 variables.
First I've created a function that separates the character according to the key provided in the dataset, where "d" is each one of the characters. For the purpose of this question imagine that d has this form
"2005400d"
and the key would be
varName <- c("YEAR","AGE","GENDER","STATUS")
varIn <- c(1,5,7,8)
varEND <- c(4,6,7,8)
where varIn and varEnd track the splitting points. The function created was.
parseLine<-function(d){
k<-unlist(strsplit(d,""))
vec<-rep(NA,length(varName))
for (i in 1:length(varName)){
vec[i]<-paste(k[varIn[i]:varEnd[i]],sep="",collapse="")
}
return(vec)
}
And then in order to loop over all the data available, I've created a for loop.
df<-data.frame(matrix(ncol=length(varName)))
names(df)<-as.character(varName)
for (i in 1:length(data)){
df<-rbind(df,parseLine(data[i]))
}
However when I check the function with 1,000 iterations I got a system time of 10.82 seconds, but when I increase that to 10,000 instead of having a time of 108.2 seconds I've got a time of 614.77 which indicates that as the number of iterations increases the time needed would increase exponentially.
Any suggestion for speeding up the process? I've tried to use the library foreach, but it didn't use the parallel as I expected.
m<-foreach(i=1:10,.combine=rbind) %dopar% parseLine(data[i])
df<-a
names(df)<-as.character(varName)