0

I am trying to parse a huge dataset into R (1.3Gb). The original data is a list comprised of four million of characters, being each one of them an observation of 137 variables.

First I've created a function that separates the character according to the key provided in the dataset, where "d" is each one of the characters. For the purpose of this question imagine that d has this form

"2005400d"

and the key would be

varName <- c("YEAR","AGE","GENDER","STATUS")
varIn   <- c(1,5,7,8)
varEND  <- c(4,6,7,8)

where varIn and varEnd track the splitting points. The function created was.

parseLine<-function(d){
  k<-unlist(strsplit(d,""))
  vec<-rep(NA,length(varName))
  for (i in 1:length(varName)){
    vec[i]<-paste(k[varIn[i]:varEnd[i]],sep="",collapse="")
  }
  return(vec)
}

And then in order to loop over all the data available, I've created a for loop.

df<-data.frame(matrix(ncol=length(varName)))
names(df)<-as.character(varName)

for (i in 1:length(data)){
  df<-rbind(df,parseLine(data[i]))
}

However when I check the function with 1,000 iterations I got a system time of 10.82 seconds, but when I increase that to 10,000 instead of having a time of 108.2 seconds I've got a time of 614.77 which indicates that as the number of iterations increases the time needed would increase exponentially.

Any suggestion for speeding up the process? I've tried to use the library foreach, but it didn't use the parallel as I expected.

m<-foreach(i=1:10,.combine=rbind) %dopar% parseLine(data[i])
df<-a
names(df)<-as.character(varName)
joran
  • 169,992
  • 32
  • 429
  • 468
comendeiro
  • 816
  • 7
  • 14
  • 1
    In your first loop, are you -just- doing `substring("2005400d", varIn, varEND)`? If not, it seems that you could use something similar to that (which is faster, too) – alexis_laz Jul 06 '14 at 19:07
  • 2
    Is your original data formatted with fixed width? Then [**this post**](http://stackoverflow.com/questions/18720036/reading-big-data-with-fixed-width) may be relevant. – Henrik Jul 06 '14 at 19:15
  • 1
    don't keep `rbind`ing to the data frame; create a list of individual data frames, then `do.call(rbind,ListOfDataFrames)` – Ben Bolker Jul 06 '14 at 19:41

1 Answers1

3

Why re-invent the wheel? Use read.fwf in the utils package (attached by default)

> dat <- "2005400d"
> varName <- c("YEAR","AGE","GENDER","STATUS")
> varIn   <- c(1,5,7,8)
> varEND  <- c(4,6,7,8)
> read.fwf(textConnection(dat), col.names=varName, widths=1+varEND-varIn)
  YEAR AGE GENDER STATUS
1 2005  40      0      d

You should get further efficiency if you specify colClasses but my effort to demonstrate this failed to show a difference. Perhaps that advice only applies to read.table and cousins.

IRTFM
  • 258,963
  • 21
  • 364
  • 487