I have a massive (8GB) dataset, which I am simply unable to read into R using my existing setup. Attempting to use fread
on the dataset crashes the R session immediately, and attempting to read in random lines from the underlying file was insufficient because: (1) I don't have a good way of knowing that total number of rows in the dataset; (2) my method was not a true "random sampling."
These attempts to get the number of rows have failed (they take as long as simply reading the data in:
length(count.fields("file.dat", sep = "|"))
read.csv.sql("file.dat", header = FALSE, sep = "|", sql = "select count(*) from file")
Is there any way via R or some other program to generate a random sample from a large underlying dataset?
Potential idea: Is it possible, given a "sample" of the first several rows to get a sense of the average amount of information contained on a per-row basis. And then back-out how many rows there must be given the size of the dataset (8 GB)? This wouldn't be accurate, but it might give a ball-park figure that I could just under-cut.