quick way to read a large flat file into r as.numeric

Question

I have a large (450MB / 250 million rows) flat file of 1s and 0s that looks like this...

I am using the following method to read it into R...

dat <- as.numeric(readLines("my_large_file"))

I am getting the desired data structure but it takes a long time. Any suggestions for a quicker method to achieve the same result?

NB. The order of the 1s and 0s is important to conserve. I would consider options in either python of unix command line but the final data structure is required in R for plotting a graph.

fread in data.table is pretty good at reading large files relatively fast — baptiste, Jul 21 '14 at 19:18

Rich Scriven · Accepted Answer · 2014-07-21T19:52:26.117

5

You might do better with scan for numeric files where you just want a vector returned.

scan("my_large_file", what = integer())

The what argument will speed up the reading of your file even more (as opposed to leaving it out), since you are effectively telling R that it will be reading integer values. scan also has many other arguments that come in handy with large numeric files (e.g. skip, nlines, etc.)

In addition, as mentioned by @baptiste in the comments,

library(data.table)
fread("my_large_file")

blows both readLines and scan away (on my machine).

NOTE: Probably a typo, but in your original post, I think readlines should be readLines

edited Jul 21 '14 at 19:52

answered Jul 21 '14 at 19:06

Rich Scriven

97,041
11
181
245

If it's just `1` and `0`, specifying `integer` as the type will be faster, since they require half the memory vs doubles. – Joshua Ulrich Jul 21 '14 at 19:07
1

Thanks, @RichardScriven, fread completes the job in seconds. – Tom Smith Jul 22 '14 at 08:25

Joshua Ulrich · Answer 2 · 2014-07-21T19:46:54.657

5

Timings comparing a couple options. First, some data.

set.seed(21)
x <- sample.int(2, 25e6, TRUE) - 1L
writeLines(as.character(x),"data")

Now, some benchmarks (each run from a new R session to avoid the file being cached).

> system.time(r <- as.numeric(readLines("data")))
   user  system elapsed 
  5.235   0.447   5.681 
> system.time(r <- scan("data",what=numeric()))
Read 25000000 items
   user  system elapsed 
  4.199   0.286   4.483 
> system.time(r <- scan("data",what=integer()))
Read 25000000 items
   user  system elapsed 
  3.134   0.081   3.214
> require(data.table)
> system.time(r <- fread("data")$V1)
   user  system elapsed 
  0.412   0.026   0.439

And verification:

> num <- as.numeric(readLines("data"))
> int <- as.integer(readLines("data"))
> sn <- scan("data",what=numeric())
Read 25000000 items
> si <- scan("data",what=integer())
Read 25000000 items
> dti <- fread("data")$V1
> identical(num,sn)
[1] TRUE
> identical(int,si)
[1] TRUE
> identical(int,dti)
[1] TRUE

edited Jul 21 '14 at 19:46

answered Jul 21 '14 at 19:25

Joshua Ulrich

173,410
32
338
418

Thanks, I just learned like three new things just from this answer. – Rich Scriven Jul 21 '14 at 19:29
@RichardScriven: I hope they were not only new, but *useful*. ;) – Joshua Ulrich Jul 21 '14 at 19:29
I find it depressing that a computer would take "a while" to generate a text file with a few million 0s and 1s. – baptiste Jul 21 '14 at 19:33
@baptiste: `write` is a wrapper to `cat`, which doesn't seem to be optimized for this sort of thing. `writeLines` is faster. – Joshua Ulrich Jul 21 '14 at 19:45
something else doesn't seem to be optimised for this: I've been unable to use my computer for the past 10 minutes, it froze when I tried to produce a black-and-white raster image with that number of pixels. Bad idea, next time I'll use my camera and shoot at the sky [for randomness](http://www.random.org/randomness/). – baptiste Jul 21 '14 at 20:02

quick way to read a large flat file into r as.numeric

2 Answers2