5

I have a 3GB csv file. It is too large to load into R on my computer. Instead I would like to load a sample of the rows (say, 1000) without loading the full dataset.

Is this possible? I cannot seem to find an answer anywhere.

Anton
  • 4,765
  • 12
  • 36
  • 50
  • 6
    See example 6e on on the sqldf home page: http://sqldf.googlecode.com for a one line solution. – G. Grothendieck Nov 24 '13 at 14:08
  • @G.Grothendieck: Judging from my task manager the whole file is loaded into RAM – Raffael Nov 24 '13 at 14:40
  • Yeah, I've been thinking about how it could work in R. What I want is to tell R to selective read.csv certain rows of a csv file. – Anton Nov 24 '13 at 15:07
  • have a look here (sampling from a large text file): http://stackoverflow.com/questions/15532810/reading-40-gb-csv-file-into-r-using-bigmemory – Ido Tamir Nov 24 '13 at 15:25
  • 1
    @Яaffaelless, sqldf uses RAM by default but with the argument `dbname = tempfile()` it will set up, use and destroy a temporary database and only the sampled rows will ever be read into R. – G. Grothendieck Nov 24 '13 at 15:53
  • @G.Grothendieck: Just saw that you are the maintainer of sqldf - great package! I am using it all the time. – Raffael Nov 24 '13 at 17:45

1 Answers1

7

If you don't want to pay thousands of dollars to Revolution R so that you can load/analyze your data in one go, sooner or later, you need to figure out a way to sample you data.

And that step is easier to happen outside R.

(1) Linux Shell:

Assuming your data falls into a consistent format. Each row is one record. You can do:

sort -R data | head -n 1000 >data.sample

This will randomly sort all the rows and get the first 1000 rows into a separate file - data.sample

(2) If the data is not small enough to fit into memory.

There is also a solution to use database to store the data. For example, I have many tables stored in MySQL database in a beautiful tabular format. I can do a sample by doing:

select * from tablename order by rand() limit 1000

You can easily communicate between MySQL and R using RMySQL and you can index your column to guarantee the query speed. Also you can verify the mean or standard deviation of the whole dataset versus your sample if you want taking the power of database into consideration.

These are the two most commonly used ways based on my experience for dealing with 'big' data.

B.Mr.W.
  • 18,910
  • 35
  • 114
  • 178
  • +1 for the overall idea, but I don't know what version of `sort` you use that has a `-R` option. – flodel Nov 24 '13 at 17:24
  • @flodel http://stackoverflow.com/questions/886237/how-can-i-randomize-the-lines-in-a-file-using-a-standard-tools-on-redhat-linux – B.Mr.W. Nov 24 '13 at 17:40
  • I know it is not R code, this is what I am referring to: http://linux.about.com/library/cmd/blcmdl1_sort.htm which doesn't have a -R option. But the link @B.Mr.W. gave has various options that should work for different OS. – flodel Nov 24 '13 at 18:00