26

Is it possible to create a progress bar for data loaded into R using load()?

For a data analysis project large matrices are being loaded in R from .RData files, which take several minutes to load. I would like to have a progress bar to monitor how much longer it will be before the data is loaded. R already has nice progress bar functionality integrated, but load() has no hooks for monitoring how much data has been read. If I can't use load directly, is there an indirect way I can create such a progress bar? Perhaps loading the .RData file in chucks and putting them together for R. Does any one have any thoughts or suggestions on this?

Nixuz
  • 3,439
  • 4
  • 39
  • 44
  • I don't know how you do a progress bar, but have you considered at least displaying a timer? I find that a running timer run makes the wait go by quicker, and then I know the program is still responding. You could display a message like `You've been waiting 1:32 and the wait is normally ~3 minutes. Grab a coffee!` – Tommy O'Dell May 29 '11 at 05:00
  • Two previous questions: http://stackoverflow.com/questions/5423760/how-do-you-create-a-progress-bar-when-using-the-foreach-function-in-r/6170107#6170107 and http://stackoverflow.com/q/3820402/583830 suggest `txtProgressBar` and `gtkProgressBar`. The latter is from the RGtk2 package. Are these what you are looking for? – jthetzel May 29 '11 at 20:36
  • Sorry, I missed that you already know of the `txtProgressBar` function and that your question is actually about loading .Rdata files. – jthetzel May 29 '11 at 20:40
  • load has no hooks for progress bars *yet* - R is open source so you can add them with a bit of programming... – Spacedman Jun 10 '11 at 07:02

2 Answers2

13

I came up with the following solution, which will work for file sizes less than 2^32 - 1 bytes.

The R object needs to be serialized and saved to a file, as done by the following code.

saveObj <- function(object, file.name){
    outfile <- file(file.name, "wb")
    serialize(object, outfile)
    close(outfile)
}

Then we read the binary data in chunks, keeping track of how much is read and updating the progress bar accordingly.

loadObj <- function(file.name){
    library(foreach)
    filesize <- file.info(file.name)$size
    chunksize <- ceiling(filesize / 100)
    pb <- txtProgressBar(min = 0, max = 100, style=3)
    infile <- file(file.name, "rb")
    data <- foreach(it = icount(100), .combine = c) %do% {
        setTxtProgressBar(pb, it)
        readBin(infile, "raw", chunksize)
    }
    close(infile)
    close(pb)
    return(unserialize(data))
}

The code can be run as follows:

> a <- 1:100000000
> saveObj(a, "temp.RData")
> b <- loadObj("temp.RData")
  |======================================================================| 100%
> all.equal(b, a)
[1] TRUE

If we benchmark the progress bar method against reading the file in a single chunk we see the progress bar method is slightly slower, but not enough to worry about.

> system.time(unserialize(readBin(infile, "raw", file.info("temp.RData")$size)))
   user  system elapsed
  2.710   0.340   3.062
> system.time(b <- loadObj("temp.RData"))
  |======================================================================| 100%
   user  system elapsed
  3.750   0.400   4.154

So while the above method works, I feel it is completely useless because of the file size restrictions. Progress bars are only useful for large files that take a long time to read in.

It would be great if someone could come up with something better than this solution!

Nixuz
  • 3,439
  • 4
  • 39
  • 44
3

Might I instead suggest speeding up the load (and save) times so that a progress bar isn't needed? If reading one matrix is "fast", you could then potentially report progress between each read matrix (if you have many).

Here's some measurements. By simply setting compress=FALSE, the load speed is doubled. But by writing a simple matrix serializer, the load speed is almost 20x faster.

x <- matrix(runif(1e7), 1e5) # Matrix with 100k rows and 100 columns

system.time( save('x', file='c:/foo.bin') ) # 13.26 seconds
system.time( load(file='c:/foo.bin') ) # 2.03 seconds

system.time( save('x', file='c:/foo.bin', compress=FALSE) ) # 0.86 seconds
system.time( load(file='c:/foo.bin') ) # 0.92 seconds

system.time( saveMatrix(x, 'c:/foo.bin') ) # 0.70 seconds
system.time( y <- loadMatrix('c:/foo.bin') ) # 0.11 seconds !!!
identical(x,y)

Where saveMatrix/loadMatrix are defined as follows. They don't currently handle dimnames and other attributes, but that could easily be added.

saveMatrix <- function(m, fileName) {
    con <- file(fileName, 'wb')
    on.exit(close(con))
    writeBin(dim(m), con)
    writeBin(typeof(m), con)
    writeBin(c(m), con)
}

loadMatrix <- function(fileName) {
    con <- file(fileName, 'rb')
    on.exit(close(con))
    d <- readBin(con, 'integer', 2)
    type <- readBin(con, 'character', 1)
    structure(readBin(con, type, prod(d)), dim=d)
}
Tommy
  • 39,997
  • 12
  • 90
  • 85
  • We only need to load a single matrix which is several (3+) gigs in size. I considered breaking up the matrix into several parts and then reading them in separately, but that is an ugly solution and not worth the complexity just for a progress bar. – Nixuz Jun 18 '11 at 23:51
  • ...so then the loadMatrix above should speed thing up considerably... Did you try it? – Tommy Jun 22 '11 at 17:33
  • The [bigmemory library](http://cran.r-project.org/web/packages/bigmemory/index.html) for saving and load very large numeric matrices, would be a good and fast solution too. – Nixuz Feb 27 '13 at 04:25