14

I am writing a package to analyse high throughput animal behaviour data in R. The data are multivariate time series. I have chosen to represent them using data.tables, which I find very convenient.

For one animal, I would have something like that:

one_animal_dt <- data.table(t=1:20, x=rnorm(20), y=rnorm(20))

However, my users and I work with many animals having different arbitrary treatments, conditions and other variables that are constant within each animal.

In the end, the most convenient way I found to represent the data was to merge behaviour from all the animals and all the experiments in a single data table, and use extra columns, which I set as key, for each one of these "repeated variables".

So, conceptually, something like that:

animal_list <- list()
animal_list[[1]] <- data.table(t=1:20, x=rnorm(20), y=rnorm(20),
                               treatment="A", date="2017-02-21 20:00:00", 
                               animal_id=1)
animal_list[[2]]  <- data.table(t=1:20, x=rnorm(20), y=rnorm(20),
                                treatment="B", date="2017-02-21 22:00:00",
                                animal_id=2)
# ...
final_dt <- rbindlist(animal_list)
setkeyv(final_dt,c("treatment", "date","animal_id"))

This way makes it very convenient to compute summaries per animal whilst being agnostic about all biological information (treatments and so on).

In practice, we have millions of (rather than 20) consecutive reads for each animal, so the columns we added for convenience contain highly repeated values, which is not memory efficient.

Is there a way to compress this highly redundant key without losing the structure (i.e. the columns) of the table? Ideally, I don't want to force my users to use JOINs themselves.

Quentin Geissmann
  • 2,240
  • 1
  • 21
  • 36
  • 1
    you can also use `ff` package and `bigmemory` to handle your data like described [here](https://rpubs.com/msundar/large_data_analysis) – parth Aug 01 '17 at 12:22
  • The correct answer to your question depends on the sizing of your problem. You have already mentioned that there are millions of consecutive reads of longitudinal data for each animal. But, how many columns of longitudinal data, e.g., `t`, `x`, `y` are being recorded? How many constant variables are stored per animal? How many animals are being observed? Thank you. – Uwe Aug 03 '17 at 07:01
  • @UweBlock It really depends. In general, we will have between five and ten variables, and sometimes up to 20 "metavariables". I try to encourage users to put as many metavariables as they can record in the lab (age, sex, location, lifespan,... of the animal). The idea is really, to remain agnostic to what type and number of metavariables we have. – Quentin Geissmann Aug 03 '17 at 14:14
  • @QuentinGeissmann You're confirming my gut feeling that there might much more constant variables or "metavariables" than just two. IMHO, this is a strong argument to keep longitudinal data and "metavariables" separate even if this is less comfortable to the users. – Uwe Aug 03 '17 at 14:22
  • 1
    I know :(. To make it less uncomfortable, I made [behavr](https://github.com/rethomics/behavr). Hopefully, it helps them. Thanks so much for the feed back! – Quentin Geissmann Aug 03 '17 at 14:30

4 Answers4

5

Let's assume, we are a database administrator given the task to implement this efficiently in a SQL database. One of the goals of database normalisation is the reduction of redundancy.

According to OP's description, there are many (about 1 M) of observations per animal (multivariate, longitudinal data) while the number of animals seems to be much smaller.

So, the constant (or invariant) base data of each animal, e.g., treatment, date, should be kept separately from the observations.

animal_id is the key into both tables assuming animal_id is unique (as the name suggests).

(Note that this is the main difference to Mallick's answer who uses treatment as key which is not guaranteed to be unique, i.e., two animals may receive the same treatment, and furthermore increases redundancy.)

Separate tables are memory efficient

For the purpose of demonstration, more realistic "benchmark" data are being created for 10 animals with 1 M observations for each animal:

library(data.table)   # CRAN version 1.10.4 used
# create observations
n_obs <- 1E6L
n_animals <-10L
set.seed(123L)
observations <- data.table(
  animal_id = rep(seq_len(n_animals), each = n_obs),
  t = rep(seq_len(n_obs), n_animals),
  x = rnorm(n_animals * n_obs), 
  y = rnorm(n_animals * n_obs))
# create animal base data
animals = data.table(
  animal_id = seq_len(n_animals),
  treatment = wakefield::string(n_animals),
  date = wakefield::date_stamp(n_animals, random = TRUE))

Here the wakefield package is used to create dummy names and dates. Note that animal_id is of type integer.

> str(observations)
Classes ‘data.table’ and 'data.frame':    10000000 obs. of  4 variables:
 $ animal_id: int  1 1 1 1 1 1 1 1 1 1 ...
 $ t        : int  1 2 3 4 5 6 7 8 9 10 ...
 $ x        : num  -0.5605 -0.2302 1.5587 0.0705 0.1293 ...
 $ y        : num  0.696 -0.537 -3.043 1.849 -1.085 ...
 - attr(*, ".internal.selfref")=<externalptr> 
> str(animals)
Classes ‘data.table’ and 'data.frame':    10 obs. of  3 variables:
 $ animal_id: int  1 2 3 4 5 6 7 8 9 10
 $ treatment:Classes 'variable', 'character'  atomic [1:10] MADxZ9c6fN ymoJHnvrRx ifdtywJ4jU Q7ZRwnQCsU ...
  .. ..- attr(*, "varname")= chr "String"
 $ date     : variable, format: "2017-07-02" "2016-10-02" ...
 - attr(*, ".internal.selfref")=<externalptr>

The combined size is about 240 Mbytes:

> object.size(observations)
240001568 bytes
> object.size(animals)
3280 bytes

Let's take this is a reference and compare with the OP's approach final_dt:

# join both tables to create equivalent of final_dt
joined <- animals[observations, on = "animal_id"]

The size has now nearly doubled (400 Mbytes) which is not memory efficient.

> object.size(joined)
400003432 bytes

Note that no data.table key was set so far. Instead the on parameter was used to specify the column to join on. If we set the key, joins will be speed up and the on parameter can be omitted:

setkey(observations, animal_id)
setkey(animals, animal_id)
joined <- animals[observations] 

How to work with separate tables?

Now, we have demonstrated that it is memory efficient to use two separate tables.

For subsequent analysis, we can aggregate the observations per animal, e.g.,

observations[, .(.N, mean(x), mean(y)), by = animal_id]
    animal_id       N            V2            V3
 1:         1 1000000 -5.214370e-04 -0.0019643145
 2:         2 1000000 -1.555513e-03  0.0002489457
 3:         3 1000000  1.541233e-06 -0.0005317967
 4:         4 1000000  1.775802e-04  0.0016212182
 5:         5 1000000 -9.026074e-04  0.0015266330
 6:         6 1000000 -1.000892e-03  0.0003284044
 7:         7 1000000  1.770055e-04 -0.0018654386
 8:         8 1000000  1.919562e-03  0.0008605261
 9:         9 1000000  1.175696e-03  0.0005042170
10:        10 1000000  1.681614e-03  0.0020562628

and join the aggregates with animals

animals[observations[, .(.N, mean(x), mean(y)), by = animal_id]]
    animal_id  treatment       date       N            V2            V3
 1:         1 MADxZ9c6fN 2017-07-02 1000000 -5.214370e-04 -0.0019643145
 2:         2 ymoJHnvrRx 2016-10-02 1000000 -1.555513e-03  0.0002489457
 3:         3 ifdtywJ4jU 2016-10-02 1000000  1.541233e-06 -0.0005317967
 4:         4 Q7ZRwnQCsU 2017-02-02 1000000  1.775802e-04  0.0016212182
 5:         5 H2M4V9Dfxz 2017-04-02 1000000 -9.026074e-04  0.0015266330
 6:         6 29P3hFxqNY 2017-03-02 1000000 -1.000892e-03  0.0003284044
 7:         7 rBxjewyGML 2017-02-02 1000000  1.770055e-04 -0.0018654386
 8:         8 gQP8cZhcTT 2017-04-02 1000000  1.919562e-03  0.0008605261
 9:         9 0GEOseSshh 2017-07-02 1000000  1.175696e-03  0.0005042170
10:        10 x74yDs2MdT 2017-02-02 1000000  1.681614e-03  0.0020562628

The OP has pointed out that he doesn't want to force his users to use joins themselves. Admittedly, typing animals[observations] takes more keystrokes than final_dt. So, it's up to the OP to decide whether this is worthwhile to save memory, or not.

This result can be filtered, for instance, if we want to compare animals with certain characteristics, e.g.,

animals[observations[, .(.N, mean(x), mean(y)), by = animal_id]][date == as.Date("2017-07-02")]
   animal_id  treatment       date       N           V2           V3
1:         1 MADxZ9c6fN 2017-07-02 1000000 -0.000521437 -0.001964315
2:         9 0GEOseSshh 2017-07-02 1000000  0.001175696  0.000504217

OP's use cases

In this coment, the OP has described some use cases which he wants to see implemenetd transparently for his users:

  • Creation of new columns final_dt[, x2 := 1-x]: As only obervations are involved, this translates directly to observations[, x2 := 1-x].
  • Select using various criteria final_dt[t > 5 & treatment == "A"]: Here columns of both tables are involved. This can be implemented with data.table in different ways (note that the conditions have been amended for the actual sample data):

    animals[observations][t < 5L & treatment %like% "MAD"]
    

    This is analogue to the expected syntax but is slower than the alternative below because here the filter conditions are applied on all rows of the full join.

    The faster alternative is to split up the filter conditions so that observations is filtered before the join to reduce the result set before the filter conditions on base data columns are applied finally:

    animals[observations[t < 5L]][treatment %like% "MAD"]
    

    Note that this looks quite similar to the expected syntax (with one keystroke less).

    If this is deemed unacceptable by the users, the join operation can be hidden in a function:

    # function definition
    filter_dt <- function(ani_filter = "", obs_filter = "") {
      eval(parse(text = stringr::str_interp(
        'animals[observations[${obs_filter}]][${ani_filter}]')))
    }
    
    # called by user
    filter_dt("treatment %like% 'MAD'", "t < 5L")
    
       animal_id  treatment       date t           x          y
    1:         1 MADxZ9c6fN 2017-07-02 1 -0.56047565  0.6958622
    2:         1 MADxZ9c6fN 2017-07-02 2 -0.23017749 -0.5373377
    3:         1 MADxZ9c6fN 2017-07-02 3  1.55870831 -3.0425688
    4:         1 MADxZ9c6fN 2017-07-02 4  0.07050839  1.8488057
    

Using factors to reduce memory footprint

Caveat: Your mileage may vary as the conclusions below depend on the internal representation of integers on your computer and the cardinality of the data. Please, see Matt Dowle's excellent answer concerning this subject.

Mallick has mentioned that memory might get wasted if integers incidentially are stored as numerics. This can be demonstrated:

n <- 10000L
# integer vs numeric vs logical
test_obj_size <- data.table(
  rep(1, n),
  rep(1L, n),
  rep(TRUE, n))

str(test_obj_size)
Classes ‘data.table’ and 'data.frame':    10000 obs. of  3 variables:
 $ V1: num  1 1 1 1 1 1 1 1 1 1 ...
 $ V2: int  1 1 1 1 1 1 1 1 1 1 ...
 $ V3: logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
 - attr(*, ".internal.selfref")=<externalptr>
sapply(test_obj_size, object.size)
   V1    V2    V3 
80040 40040 40040

Note that the numeric vector needs twice as much memory as the integer vector. Therefore, it is good programming practice to always qualify an integer constant with the suffix character L.

Also the memory consumption of character strings can be reduced if they are coerced to factor:

# character vs factor
test_obj_size <- data.table(
  rep("A", n),
  rep("AAAAAAAAAAA", n),
  rep_len(LETTERS, n),
  factor(rep("A", n)),
  factor(rep("AAAAAAAAAAA", n)),
  factor(rep_len(LETTERS, n)))

str(test_obj_size)
Classes ‘data.table’ and 'data.frame':    10000 obs. of  6 variables:
 $ V1: chr  "A" "A" "A" "A" ...
 $ V2: chr  "AAAAAAAAAAA" "AAAAAAAAAAA" "AAAAAAAAAAA" "AAAAAAAAAAA" ...
 $ V3: chr  "A" "B" "C" "D" ...
 $ V4: Factor w/ 1 level "A": 1 1 1 1 1 1 1 1 1 1 ...
 $ V5: Factor w/ 1 level "AAAAAAAAAAA": 1 1 1 1 1 1 1 1 1 1 ...
 $ V6: Factor w/ 26 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10 ...
 - attr(*, ".internal.selfref")=<externalptr>
sapply(test_obj_size, object.size)
   V1    V2    V3    V4    V5    V6 
80088 80096 81288 40456 40464 41856

Stored as factor, only half of the memory is required.

The same holds for Date and POSIXct classes:

# Date & POSIXct vs factor
test_obj_size <- data.table(
  rep(as.Date(Sys.time()), n),
  rep(as.POSIXct(Sys.time()), n),
  factor(rep(as.Date(Sys.time()), n)),
  factor(rep(as.POSIXct(Sys.time()), n)))

str(test_obj_size)
Classes ‘data.table’ and 'data.frame':    10000 obs. of  4 variables:
 $ V1: Date, format: "2017-08-02" "2017-08-02" "2017-08-02" "2017-08-02" ...
 $ V2: POSIXct, format: "2017-08-02 18:25:55" "2017-08-02 18:25:55" "2017-08-02 18:25:55" "2017-08-02 18:25:55" ...
 $ V3: Factor w/ 1 level "2017-08-02": 1 1 1 1 1 1 1 1 1 1 ...
 $ V4: Factor w/ 1 level "2017-08-02 18:25:55": 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, ".internal.selfref")=<externalptr>
sapply(test_obj_size, object.size)
   V1    V2    V3    V4 
80248 80304 40464 40480

Note that data.table() refuses to create a column of class POSIXlt as it is stored in 40 bytes instead of 8 bytes.

So, if your application is memory critical it might be worthwhile to consider to use factor where applicable.

Uwe
  • 41,420
  • 11
  • 90
  • 134
  • thanks for your comprehensive answer! I think You are right, meta and data should be kept separate. I was just hoping there would be a way to have some sort *implicit* look-up table for the metadata. My experience with my users (neurobiologists), is that keeping two tables is error prone (e.g. they may swap/corrupt {meta}data) and it is tricky for them to join... SO I wanted to provide them with something more constraining and high level. Since my question, I have made this small package: https://github.com/rethomics/behavr which I hope solves the issue in our case. Feedback welcome! – Quentin Geissmann Aug 03 '17 at 14:28
  • Wow, cool stuff! At a quick glance it reminds me of Chi Pak's nested data.frame approach, right? This is probably the right approach if the users are working mainly with one animal at a time. – Uwe Aug 03 '17 at 14:46
  • It is basically two tables just like you said. All the data, with a unique `id` column, and metadata with the same `id` as key (and all the metavariables). Now, all the work was to pack it so that the metadata is stored as an attribute within a class `behavr` that derives from `data.table`. this way, we can control that the users keep them together, that they alway have the same key, that binding several data also binds metadata, etc. I hope it makes sense! – Quentin Geissmann Aug 03 '17 at 15:32
  • Oh, I see. I had a second look at your package. So, you store the meta data as attribute of a data.table? Very clever. Won't you post this as an answer? You'll get my upvotes on the spot. I was only surprised by the many costly `copy()` operations e.g., in `behavr()` . `data.table` tries to change in place, i.e., without copying the whole object to save memory and time. What about `setBehavr(data, meta)` as equivalent to `setDT()`? – Uwe Aug 03 '17 at 16:08
  • Thank @UweBlock ! I think we change in place *everything related to data*. It was difficult to change metadata in place, so I systematically copy metadata. metadata is *at most* a few thousand rows, so it remains very cheap. I will post my answer, thanks! – Quentin Geissmann Aug 03 '17 at 17:45
  • maybe I misunderstood. I though you meant that things like `:=` now copy data. You mean saving time and memory when constructing a `behavr` object I guess. Yes, I suppose it could be useful in some cases! I should implement a `setBeahvr`! – Quentin Geissmann Aug 03 '17 at 18:38
3

You should consider using a nested data.frame

library(tidyverse)

Using a toy example where I rbind 4 copies of mtcars

new <- rbind(mtcars,mtcars,mtcars,mtcars) %>% 
         select(cyl,mpg)
object.size(new)
11384 bytes

If we group the data, which you might do for summarizing values, the size increases a bit

grp <- rbind(mtcars,mtcars,mtcars,mtcars)%>% 
         select(cyl,mpg) %>% 
         group_by(cyl)
object.size(grp)    
14272 bytes

If we nest the data as well

alt <- rbind(mtcars,mtcars,mtcars,mtcars) %>% 
         select(cyl,mpg) %>% 
         group_by(cyl) %>% 
         nest(mpg)
object.size(alt)
4360 bytes

You get a significant reduction in the object size.

NOTE You must have many repeating values to save memory in this case; for instance, a nested single copy of mtcars is larger in memory size than a single normal copy of mtcars

-----YOUR CASE-----

alt1 <- final_dt %>%
         group_by(animal_id, treatment, date) %>%
         nest()

would look like

alt1
  animal_id treatment                date              data
1         1         A 2017-02-21 20:00:00 <tibble [20 x 3]>
2         1         B 2017-02-21 22:00:00 <tibble [20 x 3]>
CPak
  • 13,260
  • 3
  • 30
  • 48
  • 1
    Many thanks, it is a very interesting structure! At the moment, my users can transparently transform variables variables `final_dt[, x2 := 1-x]`. Select using various criteria `final_dt[t > 5 & treatment == "A"]`. I fear that these conceptually simple operations will become a syntactic burden with a nested format. Ideally, I would have some sort of transparent, implicit nesting... Also, out of curiosity, how would you make a new variable, say `x+y` without having to even temporarily, unnest the whole table? – Quentin Geissmann Jul 20 '17 at 09:04
  • You can operate on the nested data using something like `%>% mutate(data = map(data, ~))`. You can see more examples in `purrr::map` documentation – CPak Jul 20 '17 at 12:42
  • 1
    `@Chi Pak` Nice answer. But he is interested in performance and is asking about data.table, so tidyverse solutions seem off topic. In addition making his package depend on every package in the tidyverse will add a lot of useless bloat. – Justin Jul 27 '17 at 21:38
  • @ChiPak, i suggest using [`parallel`](https://stat.ethz.ch/R-manual/R-devel/library/parallel/doc/parallel.pdf) package to improve performance here, somewhat like [this example](http://www.business-science.io/code-tools/2016/12/18/multidplyr.html) – parth Aug 01 '17 at 12:24
  • 1
    @parth, I'm unsure what `performance` commenters are referring to (but it seems that `performance` is being mentioned a lot), but `parallel` will definitely not help with memory, especially not `multidplyr` (which still holds R objects in memory). There are other packages like `big.memory` that help with memory usage (uses file-backing), but is somewhat clunky in its operations at the data level. OP did not ask for help with computation time... – CPak Aug 01 '17 at 13:35
  • Since I've been receiving comments regarding `performance`; @Justin, `tidyverse` is compatible with all `data.table` objects, and `purrr` can replace `tidyverse` if package bloat is a concern. Regarding `performance`, I've demonstrated a reduction in the size of R objects with highly repetitive values. I believe it's a relevant solution. – CPak Aug 01 '17 at 13:40
  • @ChiPak Whilst I like the idea of having nested data. Often, my users will need to compute or alter variables according to meta variables. In this framework, if I want to, for instance do `dt[, t := t+animal_id]`, I think I would do something like :`alt2 <- alt1 %>% mutate(data=map2(data, animal_id, ~ mutate(.x, t=t + .y)))`. For my users, this is a lot to write (and to get wrong). For instance, we already have position variables called `x` and `y`, so using `.x` and `.y` can be confusing. In addition, I believe all the data is copied twice in the process. Not sure this is the best way though. – Quentin Geissmann Aug 03 '17 at 14:20
3

Thanks a lot for all your feed back. You encouraged me to answer my own question, so here it is.

I have considered three different data structures:

  • The original one where all the metadata are naively in the same table. See my question.
  • The nested on where a table contains metadata and a special data column which contains one table per animal. See the answer by @ChiPak.
  • Two tables. One for metatadata and one for data. Those two tables can be mapped onto each other using a shared id (key). See the answer by @UweBlock.

The original approach is very convenient. For instance, it is very efficient and simple to write operations between data and metadata alike (as they are in the same table). For instance, creating or altering new metadata, or new data can be done efficiently with :=.

I have investigated the nested approach, and find it elegant, but I was not satisfied with the difficulty and error-proneness of writing statements to perform simple operations such as creating a variable according to a value of a metavariable (see my comment).

I also considered the two tables option very seriously. It is very efficient if the users know how to perform joins (which are quite verbose) and if they can keep the relationship between data and metadata (e.g. if you have several datasets, you need to ensure you have the right metadata for the right data). Ideally, metadata and data should be in the same structrure, just like nested tables are "inside" their unique parent table.

In the end, I tried to take a bit of all three approaches and came with a new data structure that I have put in a package called behavr. Data is internally stored in an instance of a class that derives from data.table, but it also has metadata as an attribute. Data and metadata share the same key. For data, it works as a regular data table e.g. dt[, y2 := y +1] Since metadata is inside the same structure, I could write a function (xmd) to "expand metadata", and implicitly join it. For instance, dt[, y3 := xmd(a_meta_variable) + y] looks up a_meta_variable in the metadata and join it to be used as a regular -- anonymous -- vector. In addition, I transformed a bit the [ operator so we can access the metadata, with [..., meta=T]: dt[meta=T] to see meta data, dt[, new_meta_var := meta_var1+ meta_var2, meta=T] to make new metavariable and dt[id < 10,meta=T] to subset it.

It really is a draft package at the moment, so I would be very happy to have some feed back and contributions! Read more at https://github.com/rethomics/behavr

Uwe
  • 41,420
  • 11
  • 90
  • 134
Quentin Geissmann
  • 2,240
  • 1
  • 21
  • 36
2

Here are two possibilities (use one, both, or none):

  1. Ensure that all column types are the most memory efficient possible. If you have integers stored as numerics, that can eat up a lot of memory.
  2. Since you don't want your users to have to do joins on their own, write a short function that does the join for them depending on the animals they want. Just put the animal information in one data.table and the treatment information in another data.table and do the merge in the function

First I'll just separate the tables:

# Same code as in question to generate data
animal_list <- list()
animal_list[[1]] <- data.table(t=1:20, x=rnorm(20), y=rnorm(20),
                               treatment="A", date="2017-02-21 20:00:00", 
                               animal_id=1)
animal_list[[2]]  <- data.table(t=1:20, x=rnorm(20), y=rnorm(20),
                                treatment="B", date="2017-02-21 22:00:00",
                                animal_id=2)
# ...
final_dt <- rbindlist(animal_list)

# Separating into treatment and animal data.tables
animals_dt <- unique(final_dt[, .(date), key = animal_id])
treatments_dt <- final_dt[, .(t, x, y, treatment), key = animal_id]

Then here's the function that does the merging for users

get_animals <- function(animal_names) {
     output <- animals_dt[animal_id %in% animal_names] # Getting desired animals
     output <- treatments_dt[output] # merging in treatment details
     return(output)
 }

Edited to use animal_id as the unique identifier instead of treatment. h/t Uwe

Mallick Hossain
  • 651
  • 5
  • 13
  • If two animals have the same treatment value the code returns an error message: "Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, : Join results in 40 rows; more than 22 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. [...]" – Uwe Aug 03 '17 at 12:57
  • Good point. After rereading the question and seeing why my code has an error in your scenario, I made the mistake of assuming that treatments were delivered at certain times. Having a unique animal_id makes a lot more sense. Thanks for pointing out that error! – Mallick Hossain Aug 03 '17 at 13:28