Monitoring R data loading progress when using read.table

Question

I've found a lot of answers for other types of data loading, but none for showing progress when R is reading data using read.table(...). I've got a simple command:

data = read.table(file=filename,
                sep="\t",
                col.names=c("time","id","x","y"),
                colClasses=c("integer","NULL","NULL","NULL"))

This loads a large amount of data in about 30 seconds or so, but a progress bar would be really nice :-D

I think it will be very difficult to do this without deep hacking of `read.table`. It's not even clear which component of `read.table` is the slow part ... I would `debug()` read.table [or use R profiling], figure out which component it was, and try to embed calls to `txtProgressBar` therein (hoping that the slow part did not drop down into C code ...) — Ben Bolker, Jun 15 '11 at 19:55
It's not worth it if it's fairly complicated and gonna take more than ten minutes. I kind of assumed there was an easy way to have R 1) count the lines in a file, and 2) update a progress bar every time it read a line — Hamy, Jun 15 '11 at 20:01
but maybe `scan` with `what=list(integer(), NULL, NULL, NULL)` is (a lot) faster? — Martin Morgan, Jun 15 '11 at 20:03

Ben Bolker · Answer 1 · 2011-06-15T21:59:15.527

Continuing experiments:

Construct a temporary working file:

n <- 1e7
dd <- data.frame(time=1:n,id=rep("a",n),x=1:n,y=1:n)
fn <- tempfile()
write.table(dd,file=fn,sep="\t",row.names=FALSE,col.names=FALSE)

Run 10 replications with read.table (with and without colClasses specified) and scan:

edit: corrected scan call in response to comment, updated results:

library(rbenchmark)
(b1 <- benchmark(read.table(fn,
                     col.names=c("time","id","x","y"),
                            colClasses=c("integer",
                              "NULL","NULL","NULL")),
                 read.table(fn,
                            col.names=c("time","id","x","y")),
          scan(fn,
               what=list(integer(),NULL,NULL,NULL)),replications=10))

Results:

2 read.table(fn, col.names = c("time", "id", "x", "y"))
1 read.table(fn, col.names = c("time", "id", "x", "y"), 
      colClasses = c("integer", "NULL", "NULL", "NULL"))
3  scan(fn, what = list(integer(), NULL, NULL, NULL))

  replications elapsed relative user.self sys.self 
2           10 278.064 1.857016   232.786   30.722    
1           10 149.737 1.011801   141.365    2.388  
3           10 143.118 1.000000   140.617    2.105

(warning, these values are slightly cooked/inconsistent because I re-ran the benchmark and merged the results ... but the qualitative result should be OK).

read.table without colClasses is slowest (that's not surprising), but only (?) about 85% slower than scan for this example. scan is only a tiny bit faster than read.table with colClasses specified.

With either scan or read.table one could write a "chunked" version that used the skip and nrows (read.table) or n (scan) parameters to read bits of the file at a time, then paste them together at the end. I don't know how much this would slow the process down, but it would allow calls to txtProgressBar between chunks ...

`scan` wants `list(integer(), NULL, NULL, NULL)` rather than quoted characters (it'll read in four character vectors otherwise). — Martin Morgan, Jun 15 '11 at 21:37
Actually, the chunked version is `fl = file(fn, "r"); while (length(r <- scan(fl, what, nmax)[[1]])) { }` with e.g., nmax=400000. Put the progress bar in `{}`. — Martin Morgan, Jun 15 '11 at 22:36
PS. I spent longer than I should have putting this sort of chunking around the `scan` call in a hacked copy of `read.table`. Then I realized that there's a problem adding the progress bar, which is that we don't know in advance how long the file is! This could be figured out fairly quickly in a Unix-like environment with a quick call to `wc`, but it all starts to seem a bit much ... — Ben Bolker, Jun 16 '11 at 15:02

Monitoring R data loading progress when using read.table

1 Answers1