2

I have a fitted model that I'd like to apply to score a new dataset stored as a CSV. Unfortunately, the new data set is kind of large, and the predict procedure runs out of memory on it if I do it all at once. So, I'd like to convert the procedure that worked fine for small sets below, into a batch mode that processes 500 lines at a time, then outputs a file for each scored 500.

I understand from this answer (What is a good way to read line-by-line in R?) that I can use readLines for this. So, I'd be converting from:

trainingdata <- as.data.frame(read.csv('in.csv'), stringsAsFactors=F)
fit <- mymodel(Y~., data=trainingdata)

newdata <- as.data.frame(read.csv('newstuff.csv'), stringsAsFactors=F)
preds <- predict(fit,newdata)
write.csv(preds, file=filename)

to something like:

trainingdata <- as.data.frame(read.csv('in.csv'), stringsAsFactors=F)
fit <- mymodel(Y~., data=trainingdata)

con  <- file("newstuff.csv", open = "r")
i = 0
while (length(mylines <- readLines(con, n = 500, warn = FALSE)) > 0) {
    i = i+1
        newdata <- as.data.frame(mylines, stringsAsFactors=F)
        preds <- predict(fit,newdata)
        write.csv(preds, file=paste(filename,i,'.csv',sep=''))
}
close(con)

However, when I print the mylines object inside the loop, it doesn't get auto-columned correctly the same way read.csv produces something that is---headers are still a mess, and whatever modulo column-width happens under the hood that wraps the vector into an ncol object isn't happening.

Whenever I find myself writing barbaric things like cutting the first row, wrapping the columns, I generally suspect R has a better way to do things. Any suggestions for how I can get a read.csv-like output form a readLines csv connection?

Community
  • 1
  • 1
Mittenchops
  • 18,633
  • 33
  • 128
  • 246

1 Answers1

2

If you want to read your data into memory in chunks using read.csv by using the skip and nrows arguments. In pseudo-code:

read_chunk = function(start, n) {
   read.csv(file, skip = start, nrows = n)
 }

start_indices = (0:no_chunks) * chunk_size + 1
lapply(start_indices, function(x) {
   dat = read_chunk(x, chunk_size)
   pred = predict(fit, dat)
   write.csv(pred)
  }

Alternatively, you could put the data into an sqlite database, and use the sqlite package to query the data in chunks. See also this answer, or do some digging with [r] large csv on SO.

Community
  • 1
  • 1
Paul Hiemstra
  • 59,984
  • 12
  • 142
  • 149
  • Also, if I understand this correctly: the lapply is still making this simultaneous, rather than batchy, right? If my process runs out of memory due to the fitting step, I'll still have no output, rather than writing the first batch of n, the second batch of n, etc.? – Mittenchops Feb 27 '13 at 15:59
  • `lapply` does this in batches, each time `read_chunk` is called it is for a different `start` and `n` combination, so for a different subset of `file`. This does not preserve the header info inside `read_chunk`, but you could read the header once outside the `lapply` loop if you really need it. – Paul Hiemstra Feb 27 '13 at 16:02
  • The read is in batches this way, but the /write/ is not, correct? I'm attempting to read, predict, and write in batches. – Mittenchops Feb 27 '13 at 16:06
  • Thanks for your help, by the way. I'm sorry to ask so heartily for clarification. – Mittenchops Feb 27 '13 at 16:07
  • Each `write.csv` only writes what has been predicted in that step. – Paul Hiemstra Feb 27 '13 at 16:07
  • It's concatenating to the end of the last csv output file? – Mittenchops Feb 27 '13 at 16:08
  • Depends, setting append = TRUE ensures concatenation. – Paul Hiemstra Feb 27 '13 at 16:13
  • Please create a reproducible example of what you have, and what goes wrong. You do have to take care with the indices you feed read_chunk, I did not test the code. – Paul Hiemstra Feb 27 '13 at 16:33
  • Sorrry, wouldn't let me update the messy code, reposted here: Hmm, I like this idea, but when I run this, the variables get off-kilter somehow with this approach where they don't with read.csv(). item[2,2] is becoming item[2,3] after the first batch, etc. – Mittenchops Feb 27 '13 at 16:37
  • What seems to be happening is that the skip command is, after the first call, starting the middle of one of the text-fields of the dataframe, rather than n rows down. I don't think skip is prepared to deal with the complications of text delimeters (but read.csv was fine taking in a small portion of the whole file at once when it was small enough.) – Mittenchops Feb 27 '13 at 16:44