1

Because my data is massive, I am working with data.table & arrow packages. I learned how to use the arrow package from this post: R (data.table): computing mean after join in most efficient way

I need to do some operations before saving my data as parquet. More specifically, I need to unlist one of the column variables (colA) by the remaining columns (colB-colC-colD) to then save it as parquet with the write_dataset command from the arrow package. The problem is that my code breaks when I try to do the unlisting (Step 3 in the code below).

# Step 1: Import *file 1* with colA, colB, colC, colE and lists colA by colB-colC-colE
df <- fread("DT1_csv_file")[,
                            .(colA=list(colA)), 
                            by=.(colB, colC, colE)
                            ] 

# Step 2: Add colD from *file 2*, by colB-colC-colE 
df <- df[fread("DT2_csv_file"), 
         on=.(colB, colC, colE), 
         allow.cartesian = TRUE][,
                                 colE:=NULL] # drop colE

# Step 3: Unlist colA by colB-colC-colD and drop duplicates
df <- unique(
  df[, .(colA = unlist(colA)), 
     by = .(colB, colC, colD)]
)

## Step 4: df is a massive data.table object. Below I save it as parquet file before dropping it from R environment
write_dataset(df, "df", format = "parquet")
rm(df)


## Step 5: Calls parquet file to compute stats
df <- arrow::open_dataset("csv/df")
output <- df %>%
  left_join(X_by_values_of_D, # another object with value X by each value of colD
            by = "colD") %>%
  group_by(colA, colB, colC) %>%
  summarize(mean(X)) %>%
  collect()

Would anyone have any advice on how to proceed? I could, for instance, write a loop where, for every combination of colB-colC-colD, I unlist colA and save it as a separate parquet.

PaulaSpinola
  • 531
  • 2
  • 10
  • Just edited the code. In the last part of the code I want to compute the average of variable `X` among each value of `colD` by each combination of `colA`-`colB`- `colC` – PaulaSpinola Feb 06 '23 at 01:57
  • I just realised that it would be a lot more efficient if I could do all operations above (Step1-Step5) under `arrow` package! Would this be possible? – PaulaSpinola Feb 06 '23 at 02:08
  • about `"X_by_values_of_D"`: I have edited my code to unquote `X_by_values_of_D` as this is a R object which I am joining to `df`. Please let me know if it is still not clear. Unfortunately, I need to keep my code solely in R - would you have any suggestions of how to make this code feasible? At the moment, it is breaking at Step 3 - before I save the object as parquet file. – PaulaSpinola Feb 06 '23 at 02:16
  • when you say the code breaks, is it because of the vector limit error – akrun Feb 06 '23 at 02:17
  • I get the error message `error message " negative length vectors are not allowed "` which happens because it exceeds the max n of rows – PaulaSpinola Feb 06 '23 at 02:20
  • Have you tried to split up the data into chunks before you do the `unlist`i.e. suppose you specify `n <- 25000` rows for each chunk `dfnew <- unique(rbindlist(lapply(split(df, as.integer(gl(nrow(df), n, nrow(df) ))), function(x) unique(x[, .(colA = unlist(colA)), .(colB, colC, colD)]))))` – akrun Feb 06 '23 at 03:00
  • This code is actually within a broader loop. Because I have different number of rows for each combination of `colB`-`colC`-`colD` across different iterations of the broader loop, I reckon it is safer to split the code into different combinations of `colB`-`colC`-`colD` instead of a fixed number of rows. – PaulaSpinola Feb 06 '23 at 03:06
  • Yes, that would be more safer. My comment was just a way to bypass the error you got – akrun Feb 06 '23 at 03:11
  • Thanks @akrun, would you know how to split the code into different combinations of `colB`-`colC`-`colD`? – PaulaSpinola Feb 06 '23 at 03:17
  • 1
    You can use `df[, grp := .GRP %/% n, .(colB, colC, colD)]; unique(rbindlist(lapply(split(df, df$grp), function(x) unique(x[, .(colA = unlist(colA)), .(colB, colC, colD)]))))` – akrun Feb 06 '23 at 03:19
  • where `n` is the required group size. (which you have to decide based on the total number of groups) – akrun Feb 06 '23 at 03:19
  • It's not clear if you have resolved your question. If not, please help us to understand by adding a small representative data sample; even though part of your problem is the size of it (which cannot/should-not be shared in the question), it's still much easier to provide actionable advice with the ability to "play" with it on our own consoles. – r2evans Feb 17 '23 at 14:54

0 Answers0