3

I have created a data.table (similar to data.frame -- see comments below) object which is approximately equal to 11MB (I found it's size using the object.size() function).

When I save this file to the disk using the save() function the resulting file has size equal to 736KB.

(1) How can this be?

(2) Is it possible to achieve this small size using manually the write.bin() function?

The data.table has 121,328 rows and 13 columns. The data type of the columns are

  1. Date (2 columns)
  2. Character (5 columns)
  3. Integer (3 columns)
  4. Numeric (3 columns)

The first five rows of the data.table are the following

          date     time QTind OPRAseqNum OEC OCC   Bid BidSize   Ask AskSize type expiration strike
 1: 2005-01-03 09:30:24     Q      94698   C     707.2       1 710.2       1    C 2006-06-17    500
 2: 2005-01-03 09:30:24     Q      94946   C     707.2       1 710.2       1    C 2006-06-17    500
 3: 2005-01-03 09:30:24     Q      94948   C     707.0       1 710.0       1    C 2006-06-17    500
 4: 2005-01-03 09:30:24     Q      94950   C     707.0       1 710.0       1    C 2006-06-17    500
 5: 2005-01-03 09:30:26     Q      98083   C     707.2       1 710.2       1    C 2006-06-17    500
conighion
  • 180
  • 7
  • 1
    How does it compare with the size of an equivalent `data.frame`? I'm not sure this is a consequence of using a `data.table`, but rather the fact that R uses extra memory to represent objects (e.g. an integer occupies much more than 4 or 8 bytes in R) - see [here](http://adv-r.had.co.nz/memory.html). – nrussell Nov 24 '15 at 15:04
  • @nrussel It is not a consequence of using `data.table`. FYI the `data.table` size is `11553440 bytes` and the corresponding `data.frame` size is `11552328 bytes`. Thank you for the link. – conighion Nov 24 '15 at 15:07
  • @nrussel I'll correct the title and tags – conighion Nov 24 '15 at 15:35

1 Answers1

2

Objects in RAM are not compressed; files written to disk are. This accounts for the difference in size. As far as I know, it isn't possible to perform operations on compressed objects in R.

There is a manual 'solution', but you probably won't like it. You could break your data.table up into smaller chunks, compress those to disk. Then if you want to do an operation on the whole table, you can decomrpess a chunk, do the operation and then recompress it. This would, of course, result in a noticeable performance hit. There also will be some extra work if you want the mean of an entire column.

Alternatively, and a bit more flexibly, to get a columnar store (if you are more frequently interested in getting some columns rather than some rows), take a look at the saves package on CRAN, but the author considers it experimental or some other disk backed columnar data-store.

However, both alternatives ultimately result in an uncompressed table in RAM (at one point or another), they just reduce the amount of the table you have to bring in.

Community
  • 1
  • 1
russellpierce
  • 4,583
  • 2
  • 32
  • 44
  • So if I understand correctly I can't save it as a compressed file using R but when R is saving it to the disk (using the `save()` function) it compresses it automatically? – conighion Nov 24 '15 at 15:03
  • @conighion More or less correct. When the data.table is an object in the R namespace (in RAM), it isn't a 'file'. When you save it to disk it becomes a 'file' and by default that is a compressed file. See ?save under the compress option to see how it is handling the compression. – russellpierce Nov 24 '15 at 15:06
  • 2
    thank you. I used the command `save(object, file = "Filename.dat", compress = F)` and the size of the file became equal to approximately 12MB. – conighion Nov 24 '15 at 15:34