0

I've recently started doing a lot more analyses with big data (big as in bigger than my RAM), so am finally having to pay more attention to clever memory use. I've read a few suggestions on stack overflow which favour matrices over data frames, when you can (ie when it's not important to store columns as different data types). Similarly for data tables, due to efficiency. I don't have a computer science background but am trying to understand the pros and cons slightly beyond surface level.

I just did a very basic text using the iris dataset:

library(iris); library(data.table)
# To make it 'fair', only use first four columns of iris where all data are numeric
test1 <- iris[,1:4] 
test2 <- as.data.table(iris[,1:4])
test3 <- as.matrix(iris[,1:4])
sort(sapply(ls(),function(x){object.size(get(x))}))  # Returns memory of objects https://stackoverflow.com/questions/1395270/determining-memory-usage-of-objects
test3 test1 test2 
 5528  5920  6568 

According to this, storing the data as a matrix uses the least memory. Followed by data frame, then data table. Very broadly, does it generally go: the more flexibility in the framework you're using the store your data, the more memory needed? Presumably also the efficiency of data table comes in the speed with which the data are able to be processed, rather than efficiency in how the data are stored?

Then I did another test but included all columns (numeric and character):

rm(list = ls())

test1 <- iris 
test2 <- as.data.table(iris)
test3 <- as.matrix(iris)
sort( sapply(ls(),function(x){object.size(get(x))})) 
test1 test2 test3 
 7256  7976 11128 

Now we include a character column, the matrix stores all data as character, and is by far the biggest memory user of the pack. Which seems odd given what I have read. Changing all data to character values might help explain this:

library(dplyr)

test1 <- iris %>% mutate_all(as.character)
test2 <- iris %>% mutate_all(as.character) %>% as.data.table()
test3 <- as.matrix(iris)
sort( sapply(ls(),function(x){object.size(get(x))})) 
test3 test1 test2 
11128 14328 15048 

Forcing both the data frame and data table to also store their data as character values throughout puts the matrix back in top position.

So my take home from this is that generally matrices store data with the least memory, data tables store comparably to data frames but have the advantage of faster processing (note that I haven't even touched on speed here so this is just based on my reading). However, more memory is needed to store the 'same' data in different formats (e.g. more needed for character than numeric), and so if you have a small proportion of character data then it may be best to store as a data frame anyway, because matrices will have to store all data as character class and therefore the advantage of using a matrix may be overshadowed by it having to store all data in a memory intensive format. This is also based only on a small data set - in particular I'm not sure if the memory limitations of data frame and table diverge with larger data sets.

Any thoughts/comments/additions would be very welcome!! I don't know if I've got the wrong end of the stick here.

ismirsehregal
  • 30,045
  • 5
  • 31
  • 78
hhattiecc
  • 149
  • 2
  • 1
    The [arrow R package](https://arrow.apache.org/docs/r/) might be of interest. – ismirsehregal Jul 11 '22 at 12:38
  • 7
    This is a question and answer site, not a discussion forum, and it's not clear from your post what your exact question is. Everything you've said seems valid. A matrix should only be used when all your data is of the same atomic type (usually numeric). If you have mixed types it's better to use a more generic container like data frame/table to avoid unnecessary/repeated conversion between types. – MrFlick Jul 11 '22 at 12:51

0 Answers0