Because my data is massive, I am working with data.table
& arrow
packages. I learned how to use the arrow
package from this post: R (data.table): computing mean after join in most efficient way
I need to do some operations before saving my data as parquet
. More specifically, I need to unlist one of the column variables (colA
) by the remaining columns (colB-colC-colD
) to then save it as parquet
with the write_dataset
command from the arrow
package. The problem is that my code breaks when I try to do the unlisting (Step 3 in the code below).
# Step 1: Import *file 1* with colA, colB, colC, colE and lists colA by colB-colC-colE
df <- fread("DT1_csv_file")[,
.(colA=list(colA)),
by=.(colB, colC, colE)
]
# Step 2: Add colD from *file 2*, by colB-colC-colE
df <- df[fread("DT2_csv_file"),
on=.(colB, colC, colE),
allow.cartesian = TRUE][,
colE:=NULL] # drop colE
# Step 3: Unlist colA by colB-colC-colD and drop duplicates
df <- unique(
df[, .(colA = unlist(colA)),
by = .(colB, colC, colD)]
)
## Step 4: df is a massive data.table object. Below I save it as parquet file before dropping it from R environment
write_dataset(df, "df", format = "parquet")
rm(df)
## Step 5: Calls parquet file to compute stats
df <- arrow::open_dataset("csv/df")
output <- df %>%
left_join(X_by_values_of_D, # another object with value X by each value of colD
by = "colD") %>%
group_by(colA, colB, colC) %>%
summarize(mean(X)) %>%
collect()
Would anyone have any advice on how to proceed? I could, for instance, write a loop where, for every combination of colB-colC-colD
, I unlist colA
and save it as a separate parquet.