3

I have a few large data files I'd like to sample when loading into R. I can load the entire data set, but it's really too large to work with. sample does roughly the right thing, but I'd like to have to take random samples of the input while reading it.

I can imagine how to build that with a loop and readline and what-not but surely this has been done hundreds of times.

Is there something in CRAN or even base that can do this?

Dustin
  • 89,080
  • 21
  • 111
  • 133
  • See [here](http://r.789695.n4.nabble.com/Efficiently-reading-random-lines-form-a-large-file-td825269.html) for some ideas. – joran Aug 27 '11 at 02:50

3 Answers3

4

You can do that in one line of code using sqldf. See part 6e of example 6 on the sqldf home page.

G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • sqldf looks pretty good, but not exactly what I was looking for. I think it might be the right thing in the long term, though. – Dustin Sep 01 '11 at 23:24
  • Should the example be added into the answer to make it so that it's not basically a link only answer? – Dason Jan 07 '15 at 16:43
2

No pre-built facilities. Best approach would be to use a database management program. (Seems as though this was addressed in either SO or Rhelp in the last week.)

Take a look at: Read csv from specific row , and especially note Grothendieck's comments. I consider him a "class A wizaRd". He's got first hand experience with sqldf. (The author IIRC.)

And another "huge files" problem with a Grothendieck solution that succeeded: R: how to rbind two huge data-frames without running out of memory

Community
  • 1
  • 1
IRTFM
  • 258,963
  • 21
  • 364
  • 487
0

I wrote the following function that does close to what I want:

readBigBz2 <- function(fn, sample_size=1000) {
    f <- bzfile(fn, "r")
    rv <- c()
    repeat {
        lines <- readLines(f, sample_size)
        if (length(lines) == 0) break
        rv <- append(rv, sample(lines, 1))
    }
    close(f)
    rv
}

I may want to go with sqldf in the long-term, but this is a pretty efficient way of sampling the file itself. I just don't quite know how to wrap that around a connection for read.csv or similar.

Dustin
  • 89,080
  • 21
  • 111
  • 133