0

Is it possible to draw bootstrap samples from an .Rdata object, or any other large data object stored on disk? My current approach for sampling from very large data is to build a local MySQL database, and then draw random samples into R using SQL. Unfortunately, sampling and sorting in MySQL is not efficient at all. I'm wondering if anyone has engineered a better solution for this use case.

To get a sense of my current solution, see this question on sampling in MySQL: Simple Random Samples from a Sql database

Community
  • 1
  • 1
Ben Rollert
  • 1,564
  • 1
  • 13
  • 21
  • 2
    You could use the ff package, use a file-based object and sample from that. Ditto via the bigmemory package which also allows file-backed objects. – Dirk Eddelbuettel Aug 07 '14 at 02:08

1 Answers1

5

General remarks

You don't have to load all data to sample, just the ids of the rows, and sample from the ids. Then load data for the sampled rows only. In more detail:

  1. E.g. if you have a column called ID in your database, then load only this column. This should be fast, especially if ID is a single integer. Even if you have (say) 2 billion records, you only need 8GB memory for storing 2 billion integers, so this should be possible.

  2. Then sample from these ids.

  3. Then load only the records with the sampled ids.

Just think about it. If you want to buy three random books from the online bookstore Siren, what do you do? Siren has millions of books, you can't order all them, and then choose randomly between them, and send back the rest, right? So what you do is, you ask Siren for a list of book ids (ISBN will do), this is of a size that fits into a modest computer. You choose three from the list randomly, and order these three from Siren.

Rdata files

This obviously does not work for .Rdata files, but .Rdata files are hopeless anyway, because you cannot load a piece of an .Rdata file. So you would need some format that is indexed.

sqlite

But you could use sqlite and the RSQLite package. RSQLite supports binding with data frames, so to load the sampled data all you need is to put the sampled ids in a data frame called samp_ids (single column named id), and then say something like

...
my_samp <- dbGetQuery(con, "SELECT * FROM mytable WHERE id = :id", 
                      bind.data = samp_ids)
...

This will read the records with the sampled ids.

MySQL

Last time I checked RMySQL did not support binding, so that is probably not the way to go. If you insist on MySQL, then the first thing I would try is creating a temporary table, with only the sampled row ids in it, and then JOIN this table with your data table. This is supported by RMySQL, it works like this:

dbSendQuery(con, "CREATE TEMPORARY TABLE tmp (id INTEGER);")

## Stupid, but this seems to be necessary to really create the
## temporary table, it is an RMySQL bug that I reported long time
## ago: https://github.com/jeffreyhorner/RMySQL/issues/10
try(dbSendQuery(con, "CREATE TEMPORARY TABLE tmp (id INTEGER);"))

dbWriteTable(con, "tmp", samp_ids, row.names=FALSE, append=TRUE)

Then you can write your SELECT to JOIN the temporary table tmp and your original data table.

Gabor Csardi
  • 10,705
  • 1
  • 36
  • 53
  • This feels like a comment. – Brandon Bertelsen Aug 07 '14 at 02:31
  • 2
    Would you like to expand on how one would use the row ids to sample from an object that is not stored in memory? – thelatemail Aug 07 '14 at 02:35
  • Load the row ids only. E.g. if it is an SQL table then the SQL row ids. If the object does not have row ids, then assign ids to the rows. – Gabor Csardi Aug 07 '14 at 02:59
  • 1
    Not at all, just provide an example of what you mean. I think @thelatemail is interested as well. – Brandon Bertelsen Aug 07 '14 at 03:04
  • @BrandonBertelsen: better? – Gabor Csardi Aug 07 '14 at 03:28
  • Yes i've thought of this solution as well. I guess a related question is how to write the most efficient WHERE clause in SQL, i.e. do you use id IN , do you JOIN to the vector, etc. Of course, that's a lower level issue that can be resolved separately. – Ben Rollert Aug 07 '14 at 03:44
  • @BenRollert better? :) – Gabor Csardi Aug 07 '14 at 04:07
  • @BenRollert: well, not really a low level issue, because this will probably be your new bottleneck, so you want to do this efficiently, otherwise my solution is useless. I have added some tips on how to do it. – Gabor Csardi Aug 07 '14 at 04:18
  • Awesome. I'm wondering if the RSQLite solution could be multithreaded, that is could you load a sample (plus a fit a model, process the data, etc.) on each iteration of %dopar% using multiple cores on a single machine. If so, that would seem like a winner. – Ben Rollert Aug 07 '14 at 16:51
  • Depends on the backend you use for `%dopar%`, I guess. sqlite itself if thread-safe, so unless `RSQLite` is doing something nasty, you should be able to use the multicore backend. Based on http://www.sqlite.org/faq.html#q6 you will probably need a separate sqlite connection for each thread. – Gabor Csardi Aug 07 '14 at 18:31