How to write multiple arrow/parquet files in chunks while reading in large data quantities of data so that all written files are one dataset?

Question

I'm working in R with the arrow package. I have multiple tsv files:

sample1.tsv
sample2.tsv
sample3.tsv
sample4.tsv
sample5.tsv
...
sample50.tsv

each of the form

| id   | start| end | value|
| ---  | -----|-----|------|
| id1  | 1    | 3   |  0.2 |
| id2  | 4    | 6   |  0.5 |
| id.  | ...  | ... |  ... |
| id50 | 98   | 100 |  0.5 |

and an index file:

|  id  | start| end | 
| ---- | -----|-----|
| id1  | 1    | 3   |
| id2  | 4    | 6   |  
| id.  | ...  | ... |
| id50 | 98   | 100 |

I use the index file to left join on id, start and end with each sample to get a datatable like this:

| id   | start| end | sample1 | sample2 | sample(n) |
| ---  | -----|-----|---------|---------|-----------|
| id1  | 1    | 3   |  0.2    |  0.1    |   ...     |
| id2  | 4    | 6   |  0.5    |  0.8    |   ...     |
| id.  | ...  | ... |  ...    |  ...    |   ...     |
| id50 | 98   | 100 |  0.5    |  0.6    |   ...     |

With multiple samples. I'd like to read them in chunks (eg: chunk_size=5), and when I have a table of chunk_size samples read, write that joined datatable as a parquet file to disk.

Currently, I'm able to write each chunked datatable to disk and I read them with open_dataset(datadir). In a loop with i as the sample_number:

# read and join
...
if (i %% chunk_size == 0) {
  write_parquet(joined_table, paste0("datadir", "chunk", i / chunk_size, ".parquet"))
}
...
# clear the data table of samples

However, even though the arrow package says it read as many files as were written, when I check the columns available, only the columns from the first chunk are found.

data <- arrow::open_dataset("datadir")
data
# FileSystemDataset with 10 Parquet files
# id: string
# start: int32
# end: int32
# sample1: double
# sample2: double
# sample3: double
# sample4: double
# sample5: double

Samples 6-50 are missing. Reading the parquet files individually shows that each contains the samples from their chunk.

data2 <- arrow::open_dataset("datadir/chunk2.parquet")
data2
# FileSystemDataset with 1 Parquet file
# id: string
# start: int32
# end: int32
# sample6: double
# sample7: double
# sample8: double
# sample9: double
# sample10: double

Are parquet files the right format for this task? I'm not sure what I'm missing to make a splintered set of files that are all the same dataset when read in.

UPDATE: Opening multiple parquet files with arrow requires a schema that describes the columns in the dataset.

my_schema <- list(
      arrow::field("id", string()),
      arrow::field("start", int32()),
      arrow::field("sample1", int32()), 
      arrow::field("sample2", int32()), 
      arrow::field("sample3", int32()), 
      arrow::field("sample4", int32()), 
      arrow::field("sample5", int32()), 
      arrow::field("sample6", int32()), 
      arrow::field("sample7", int32()), 
      arrow::field("sample8", int32()), 
      arrow::field("sample9", int32()), 
      arrow::field("sample10", int32()) 
)

open_dataset("dataset_dir", schema = my_schema)
# FileSystemDataset with 10 Parquet files
id: string
start: int32
sample1: int32
sample2: int32
sample3: int32
sample4: int32
sample5: int32
sample6: int32
sample7: int32
sample8: int32
sample9: int32
sample10: int32

The only problem now is that the number of rows in the dataset returned by open_dataset with this schema is a multiple of the number of parqet files written and the expected number of rows. If the full dataset had 10 rows and were written to 2 parquet files, nrow(dataset) would give 20 instead of 10. I have the first two columns written to each parquet file which may be causing this.

Instead of something like:

| id   | start| end | sample1 | sample2 |
| ---  | -----|-----|------   |---------|
| id1  | 1    | 3   |  0.2    | 0.5     |
| id2  | 4    | 6   |  0.5    | 0.1     |
| id.  | ...  | ... |  ...    |         |
| id50 | 98   | 100 |  0.5    | 0.5     |

it looks like:

| id   | start| end | sample1 | sample2 |
| ---  | -----|-----|-------- |------   |
| id1  | 1    | 3   |  0.2    | NA      |
| id2  | 4    | 6   |  0.5    | NA      |
| id.  | ...  | ... |  ...    | NA      |
| id50 | 98   | 100 |  0.5    | NA      |
| id1  | 1    | 3   |  NA     | 0.5     |
| id2  | 4    | 6   |  NA     | 0.1     |
| id.  | ...  | ... |  NA     | ...     |
| id50 | 98   | 100 |  NA     | 0.5     |

Its doing something closer to an rbind than a cbind on the parquet files. Is it possible to read in a set of parquet files with different schema that are merged column-wise?

How to write multiple arrow/parquet files in chunks while reading in large data quantities of data so that all written files are one dataset?

0 Answers0

Linked