4

I'm doing some analysis in R where I need to work with some large datasets (10-20GB, stored in .csv, and using the read.csv function).

As I will also need to merge and transform the large .csv files with other data frames, I don't have the computing power or memory to import the entire file.

I was wondering if anyone knows of a way to import a random percentage of the csv.

I have seen some examples where people have imported the entire file and then used a separate function to create another data frame that is a sample of the original, however I am hoping for something a little less intensive.

pogibas
  • 27,303
  • 19
  • 84
  • 117
RMAkh
  • 123
  • 1
  • 10
  • 3
    I think you should put your data in a database. [This answer](http://stackoverflow.com/a/1820610/1412059) might be useful. – Roland Jan 16 '15 at 10:07
  • I use both a Mac (Yosemite) and a PC (Windows 7) – RMAkh Jan 16 '15 at 10:25
  • One option might be to use a unix command line tool like `awk`, there's a good discussion of that here: http://stackoverflow.com/questions/692312/randomly-pick-lines-from-a-file-without-slurping-it-with-unix Once you sample with `awk`, then read into R. – Statwonk Jan 16 '15 at 10:44

1 Answers1

6

I think that there is not a good R tool to read a file in a random way (maybe it can be an extension read.table or fread(data.table package)) .

Using perl you can easily do this task. For example , to read 1% of your file in a random way, you can do this :

xx= system(paste("perl -ne 'print if (rand() < .01)'",big_file),intern=TRUE)

Here I am calling it from R using system. xx contain now only 1% of your file.

You can wrap all this in a function:

read_partial_rand <- 
  function(big_file,percent){
    cmd <- paste0("perl -ne 'print if (rand() < ",percent,")'")
    cmd <- paste(cmd,big_file)
    system(cmd,intern=TRUE)
  }
agstudy
  • 119,832
  • 17
  • 199
  • 261