1

I have a large (450MB / 250 million rows) flat file of 1s and 0s that looks like this...

    1
    0
    0
    1
    0
    1
    0
    etc...

I am using the following method to read it into R...

dat <- as.numeric(readLines("my_large_file"))

I am getting the desired data structure but it takes a long time. Any suggestions for a quicker method to achieve the same result?

NB. The order of the 1s and 0s is important to conserve. I would consider options in either python of unix command line but the final data structure is required in R for plotting a graph.

Tom Smith
  • 371
  • 2
  • 3
  • 14

2 Answers2

5

You might do better with scan for numeric files where you just want a vector returned.

scan("my_large_file", what = integer())

The what argument will speed up the reading of your file even more (as opposed to leaving it out), since you are effectively telling R that it will be reading integer values. scan also has many other arguments that come in handy with large numeric files (e.g. skip, nlines, etc.)

In addition, as mentioned by @baptiste in the comments,

library(data.table)
fread("my_large_file")

blows both readLines and scan away (on my machine).

NOTE: Probably a typo, but in your original post, I think readlines should be readLines

Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
5

Timings comparing a couple options. First, some data.

set.seed(21)
x <- sample.int(2, 25e6, TRUE) - 1L
writeLines(as.character(x),"data")

Now, some benchmarks (each run from a new R session to avoid the file being cached).

> system.time(r <- as.numeric(readLines("data")))
   user  system elapsed 
  5.235   0.447   5.681 
> system.time(r <- scan("data",what=numeric()))
Read 25000000 items
   user  system elapsed 
  4.199   0.286   4.483 
> system.time(r <- scan("data",what=integer()))
Read 25000000 items
   user  system elapsed 
  3.134   0.081   3.214
> require(data.table)
> system.time(r <- fread("data")$V1)
   user  system elapsed 
  0.412   0.026   0.439 

And verification:

> num <- as.numeric(readLines("data"))
> int <- as.integer(readLines("data"))
> sn <- scan("data",what=numeric())
Read 25000000 items
> si <- scan("data",what=integer())
Read 25000000 items
> dti <- fread("data")$V1
> identical(num,sn)
[1] TRUE
> identical(int,si)
[1] TRUE
> identical(int,dti)
[1] TRUE
Joshua Ulrich
  • 173,410
  • 32
  • 338
  • 418
  • Thanks, I just learned like three new things just from this answer. – Rich Scriven Jul 21 '14 at 19:29
  • @RichardScriven: I hope they were not only new, but *useful*. ;) – Joshua Ulrich Jul 21 '14 at 19:29
  • I find it depressing that a computer would take "a while" to generate a text file with a few million 0s and 1s. – baptiste Jul 21 '14 at 19:33
  • @baptiste: `write` is a wrapper to `cat`, which doesn't seem to be optimized for this sort of thing. `writeLines` is faster. – Joshua Ulrich Jul 21 '14 at 19:45
  • something else doesn't seem to be optimised for this: I've been unable to use my computer for the past 10 minutes, it froze when I tried to produce a black-and-white raster image with that number of pixels. Bad idea, next time I'll use my camera and shoot at the sky [for randomness](http://www.random.org/randomness/). – baptiste Jul 21 '14 at 20:02