0

I want to manipulate, store and retrieve nested data in R, but to my surprise the nested data frame features a substantial increase in size:

pacman::p_load(dplyr, tidytable)

test3 <- tibble(ID = 1:1e5) %>% 
  group_by(ID) %>% 
  summarise(number = 1:sample(1:4, size = 1), .groups = "drop") %>% 
  mutate(Date = sample(seq.Date(from = as.Date("2021-01-01"),
                                 to = as.Date("2021-12-31"), by = 1),
                        size = n(), replace = TRUE)) 

test4 <- test3 %>% nest_by(ID)

prettyNum(object.size(test3), big.mark = ",") 4 MB
prettyNum(object.size(test4), big.mark = ",") 132 MB

The same issue exists with tidytable.

Nesting of data is a cool idea because it helps to control problems of data duplication if data is not two-dimensional.

But that memory increase is problematic.

Furthermore, write_fst refuses to write data if there are nested columns, so I may need a different solution here as well.

Do you have any suggestions?

mzuba
  • 1,226
  • 1
  • 16
  • 33
  • Where do you need to be able to read the data after you're written it? Saving as R's data format .rds is fairly efficient as long as you don't need to open it outside of R. There are also formats like feather that are cross-compatible with Python. – camille Jan 27 '22 at 15:09

1 Answers1

1

The simple answer is don't nest your data.

An array of numeric can be efficiently stored and retrieved, because they are close to each other, but, nested data are spread in memory and data frame needs the address of each observation to access them and retrieve the values. In other words, array of numeric is a single object and an array of nested values is a collection (in your case 10^5) small objects.

barpapapa
  • 153
  • 3