3

I have a massive (8GB) dataset, which I am simply unable to read into R using my existing setup. Attempting to use fread on the dataset crashes the R session immediately, and attempting to read in random lines from the underlying file was insufficient because: (1) I don't have a good way of knowing that total number of rows in the dataset; (2) my method was not a true "random sampling."

These attempts to get the number of rows have failed (they take as long as simply reading the data in:

  1. length(count.fields("file.dat", sep = "|"))
  2. read.csv.sql("file.dat", header = FALSE, sep = "|", sql = "select count(*) from file")

Is there any way via R or some other program to generate a random sample from a large underlying dataset?

Potential idea: Is it possible, given a "sample" of the first several rows to get a sense of the average amount of information contained on a per-row basis. And then back-out how many rows there must be given the size of the dataset (8 GB)? This wouldn't be accurate, but it might give a ball-park figure that I could just under-cut.

Parseltongue
  • 11,157
  • 30
  • 95
  • 160
  • 2
    Put it in a SQLite db and grab a random set of rows either directly via RSQLite or via dplyr. – joran Dec 08 '17 at 16:02
  • There are methods like [reservoir sampling](https://en.wikipedia.org/wiki/Reservoir_sampling) but I don't know of any implementations in common R packages. – MrFlick Dec 08 '17 at 16:02
  • 1
    Read first column to get the number of rows (with `col.names`). then read it in in chunks using the `skip` and `nrow` arguments of the `fread` function. Haven't tested it ever but it might work although it will probably be very slow. Every time you get a chunk, get a sample out of it and `rm` it. – LyzandeR Dec 08 '17 at 16:30
  • [This post](https://stackoverflow.com/questions/39691133/r-data-table-counting-rows-with-fread-without-reading-the-whole-file) contains information on obtaining a row count without reading the file. If you obtain the row count, you might be able to randomly sample from 1 to that row count and read in only that subset. – Nick Criswell Dec 08 '17 at 16:42
  • I appreciate that post, Nick, but how do you recommend randomly sampling from a dataset? LyzandeR's method (at least my implementation) is far too slow for my purposes. – Parseltongue Dec 08 '17 at 16:47
  • Did you check out the [LaF](https://cran.r-project.org/web/packages/LaF/index.html) package, this is rather fast to take in chunks of rows ? See a SO [here](https://stackoverflow.com/questions/19894194/reading-in-chunks-at-a-time-using-fread-in-package-data-table/27535019#27535019) – user3375672 Dec 09 '17 at 05:42

1 Answers1

3

Here's one option, using the ability of fread to accept a shell command that preprocesses the file as its input. Using this option we can run a gawk script to extract the required lines. Note you may need to install gawk if it is not already on your system. If you have awk instead on your system, you can use that instead.

First lets create a dummy file to test on:

library(data.table)
dt = data.table(1:1e6, sample(letters, 1e6, replace = TRUE))
write.csv(dt, 'test.csv', row.names = FALSE)

Now we can use the shell command wc to find how many lines there are in the file:

nl = read.table(pipe("wc -l test.csv"))[[1]]

Take a sample of line numbers and write them (in ascending order) to a temp file which makes them accessible easily to gawk.

N = 20 # number of lines to sample
sample.lines = sort(sample(2:nl, N)) #start sample at line 2 to exclude header 
cat(paste0(sample.lines, collapse = '\n'), file = "lines.txt")

Now we are ready to read in the sample using fread and gawk (based on this answer). You can also try some of the other gawk scripts in this linked question which could possibly be be more efficient on very large data.

dt.sample = fread("gawk 'NR == FNR {nums[$1]; next} FNR in nums' lines.txt test.csv")
dww
  • 30,425
  • 5
  • 68
  • 111
  • Thanks for your answer. The line `nl = read.table(pipe("wc -l test.csv"))[[1]]` I suspect suffers from the same problems as in my question. I have had my R console running for 20 minutes, and the line has has yet to terminate. I'm going to try the gawk idea with a ballpark estimate of the rows and get back to you. – Parseltongue Dec 08 '17 at 19:12