I have a function that creates a data.table
with around 29 million rows and a user defined number of columns based on an input sample list. It reads individual sample files with an index column and joins them column-wise to a master index column to create this large data.table
.
index | sample1 | sample2 |
---|---|---|
1 | 1 | 2 |
2 | 3 | 4 |
... | ... | ... |
29m | 5 | 6 |
Since this takes up a lot of memory when done all at once, I'd like to read in a a few files at one time, join them to the index, write them to disk as parquet files, and clear them from memory. From the documentation, I cannot tell if this is possible or whether partitioning files is only possible when the full dataset is present and only with values in the columns and not the column names themselves. Doing something like
arrow::write_dataset(DT, partitioning = c("sample1", "sample2", ...)
gives a directory with as many sub-directories as there are values in each of those columns.
I have tried an approach that writes parquets to a directory, but reading them back in is proving hard since I can't figure out how to read multiple parquet files and join them columnwise. I've asked another question here about that: How to write multiple arrow/parquet files in chunks while reading in large data quantities of data so that all written files are one dataset?
If I have 10 sample files and want to chunk them in groups of 5, I'm expecting to make a directory of 2 parquet files, each with the index column, and 5 sample groups.
part-1.parquet: index, sample1, sample2, sample3, sample4, sample5
part-2.parquet: index, sample6, sample7, sample8, sample9, sample10
I can also have three files - one index parquet, one parquet with the first 5 samples, and another with the last 5 samples - if column duplication is not good practice.
index.parquet: index
part-1.parquet: sample1, sample2, sample3, sample4, sample5
part-2.parquet: sample6, sample7, sample8, sample9, sample10