2

I have two massive datasets. I created a simple example (below) to replicate the structure of my data: DT_ade has variables a, d, e, while DT_abce has variables a, b, c, e. I would like to compute the average of d by a-b-c. In the code below, I merged the two DT based on values of a-e. Because I am not interested in variable e, I just selected the remaining variables after the join (a, b, c, d). I then computed the average of d by a-b-c.

library(data.table)


DT_ade = data.table(a = c("a1", "a1", "a2", "a3", "a3"),
                    d = c(10, 20, 30, 40, 80) ,
                    e = c("e1", "e2", "e2", "e2", "e3"))

DT_abce = data.table(a = c("a1", "a1", "a1", "a1", "a2", "a3", "a3"),
                     b = c("b1", "b1", "b2", "b2", "b2", "b2", "b2"),
                     c = c("c1", "c1", "c2", "c2", "c2", "c2", "c2"),
                     e = c("e1", "e2", "e1", "e2", "e2", "e2", "e3"))
  

DT_ade[
  DT_abce, on=.(a, e), .(a, b, c, d)
  ][, .(mean_d = mean(d, na.rm=TRUE)),
    .(a, b, c)]

The code works in this simple example: it gives the following output:

    a  b  c mean_d
1: a1 b1 c1     15
2: a1 b2 c2     15
3: a2 b2 c2     30
4: a3 b2 c2     60

However, when applying this code to my real data, I get the following error message due to the size of my data:

Error in gforce(thisEnv, jsub, o__, f__, len__, irows) : 
  Internal error: Failed to allocate counts or TMP when assigning g in gforce

I am wondering if there was any less demanding way to do this. For instance, could I compute the mean d by a-b-c while not generating the full list a-b-c-d of as a previous step?

PaulaSpinola
  • 531
  • 2
  • 10
  • For bigger than memory datasets, you could use [arrow](https://arrow.apache.org/docs/r/#using-dplyr-with-arrow), see example [here](https://stackoverflow.com/a/72736965/13513328) – Waldi Dec 19 '22 at 05:24
  • Thank you @Waldi! My data is in csv. I tried changing the format as you did with the code below but it is taking forever `> df <- fread("csv/df.csv") |--------------------------------------------------| |==================================================| |--------------------------------------------------| |==================================================| > write_feather(df, "csv/df.feather")` – PaulaSpinola Dec 19 '22 at 10:55

1 Answers1

2

I'll expand on Waldi's recommendation for arrow with an example.

Up front: the benefits of using arrow are when dealing with large amounts of data and its lazy retrieval/processing of data. While I use arrow::arrow_table here, this is because it's convenient for this example; in your case, they should be parquet files on the filesystem (preferably local).

That is, ideally you would create these objects as:

ds_ade <- arrow::open_dataset("path/to/ade.parquet")
ds_abce <- arrow::open_dataset("path/to/abcd.parquet")

But for now, I'll use:

ds_ade <- arrow::arrow_table(DT_ade)
ds_abce <- arrow::arrow_table(DT_abce)

Each of these represents the arrow objects, as in: no data pulled, just what it looks like:

ds_ade
# Table
# 5 rows x 3 columns
# $a <string>
# $d <double>
# $e <string>

From here, a relatively simple dplyr-pipe can be used that doesn't pull data or do the calculations yet:

library(dplyr)
left_join(ds_ade, ds_abce, on = c("a", "e")) %>%
  group_by(a, b, c) %>%
  summarize(mean_d = mean(d, na.rm=TRUE)) %>%
  ungroup()
# Table (query)
# a: string
# b: string
# c: string
# mean_d: double
# See $.data for the source Arrow object

We just add %>% collect() to that, and the data materializes (i.e., is calculated and then pulled into R):

left_join(ds_ade, ds_abce, on = c("a", "e")) %>%
  group_by(a, b, c) %>%
  summarize(mean_d = mean(d, na.rm=TRUE)) %>%
  ungroup() %>%
  collect()
# # A tibble: 4 × 4
#   a     b     c     mean_d
#   <chr> <chr> <chr>  <dbl>
# 1 a1    b2    c2        15
# 2 a1    b1    c1        15
# 3 a2    b2    c2        30
# 4 a3    b2    c2        60

Don't confuse "speed of data.table-vs-dplyr" with what is going on here: you're losing no speed, since the calculations are being done in arrow, not in data.table or dplyr. It is not until the collect() that the calculations are actually done (in arrow) and then R gets to first see the data. If you then choose to as.data.table this data (for the rest of your processing), that's fine.

The "pain" of this approach will be to put your data into parquet files in the first place. Since your data is fairly large, it seems likely that when it is loaded into R, you are in a fragile situation: if your windows computer decides it must update now and reboot, or if you lose power (and UPS/batteries fail), or something else happens, you're sunk, and need to create or load all of the data again.

In my experience, depending on your real data, you might want to load your CSV into R first anyway, do some simple processing (i.e., convert to POSIXt, set factors and levels, etc, but not reshaping/aggregation), and then save to parquet. When saved, the underlying data class and attributes are preserved. In fact, since you are saving a data.table, you can very quickly (inexpensively) run ds_ade %>% head() %>% collect() and see that (1) it is nearly instantaneous, and (2) it is a data.table. If you have other classed objects (POSIXt, factor), then they will retain those classes. filtering is efficient, especially when the variables are numeric. (But I'm not an arrow-maintainer/designer/engineer, so it's always possible I have misunderstand and therefore misrepresented some portions of the laziness here.)

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • thanks a million @r2evans! This is extremely helpful! When running the `left_join` command in in my data I am getting the following error message `Error in `collect()`: ! Invalid: key_cmp and keys cannot be empty Run `rlang::last_error()` to see where the error occurred.` – PaulaSpinola Dec 20 '22 at 01:24
  • I've never seen that error before. Perhaps you can post closer-to-real data (such as `head(.)` from each)? – r2evans Dec 20 '22 at 01:59
  • Thank you @r2evans! It has nothing to do with `arrow` package: I was using the wrong syntax for `left_join` when the column name is different between the two datasets being joint – PaulaSpinola Dec 20 '22 at 12:36
  • Does that mean it works without memory-allocation issues? – r2evans Dec 20 '22 at 13:01
  • I solved the error message above by just using the right syntax of the `join_left` function (I was using `on` instead of `by` before). Unfortunately the code breaks when I use my real data as it is really massive. Once I run the entire code chunk (after I have called for the `parquet` files), my R session is aborted after a minute or so. – PaulaSpinola Dec 20 '22 at 18:41
  • Bummer. More thoughts: (1) Consider doing this in batches, perhaps subset 10 IDs from each and join them, then save somewhere; repeat with another 10 IDs. The right number might be 10 or 1000 or something, in between, I don't know. (2) Consider a full-scale DBMS, though even with large data you're still constrained to the underlying system memory. – r2evans Dec 20 '22 at 19:04