I'm working in R with the arrow
package. I have multiple tsv files:
sample1.tsv
sample2.tsv
sample3.tsv
sample4.tsv
sample5.tsv
...
sample50.tsv
each of the form
| id | start| end | value|
| --- | -----|-----|------|
| id1 | 1 | 3 | 0.2 |
| id2 | 4 | 6 | 0.5 |
| id. | ... | ... | ... |
| id50 | 98 | 100 | 0.5 |
and an index file:
| id | start| end |
| ---- | -----|-----|
| id1 | 1 | 3 |
| id2 | 4 | 6 |
| id. | ... | ... |
| id50 | 98 | 100 |
I use the index file to left join on id
, start
and end
with each sample to get a datatable like this:
| id | start| end | sample1 | sample2 | sample(n) |
| --- | -----|-----|---------|---------|-----------|
| id1 | 1 | 3 | 0.2 | 0.1 | ... |
| id2 | 4 | 6 | 0.5 | 0.8 | ... |
| id. | ... | ... | ... | ... | ... |
| id50 | 98 | 100 | 0.5 | 0.6 | ... |
With multiple samples. I'd like to read them in chunks (eg: chunk_size=5
), and when I have a table of chunk_size
samples read, write that joined datatable as a parquet file to disk.
Currently, I'm able to write each chunked datatable to disk and I read them with open_dataset(datadir)
. In a loop with i
as the sample_number:
# read and join
...
if (i %% chunk_size == 0) {
write_parquet(joined_table, paste0("datadir", "chunk", i / chunk_size, ".parquet"))
}
...
# clear the data table of samples
However, even though the arrow package says it read as many files as were written, when I check the columns available, only the columns from the first chunk are found.
data <- arrow::open_dataset("datadir")
data
# FileSystemDataset with 10 Parquet files
# id: string
# start: int32
# end: int32
# sample1: double
# sample2: double
# sample3: double
# sample4: double
# sample5: double
Samples 6-50 are missing. Reading the parquet files individually shows that each contains the samples from their chunk.
data2 <- arrow::open_dataset("datadir/chunk2.parquet")
data2
# FileSystemDataset with 1 Parquet file
# id: string
# start: int32
# end: int32
# sample6: double
# sample7: double
# sample8: double
# sample9: double
# sample10: double
Are parquet files the right format for this task? I'm not sure what I'm missing to make a splintered set of files that are all the same dataset when read in.
UPDATE: Opening multiple parquet files with arrow requires a schema that describes the columns in the dataset.
my_schema <- list(
arrow::field("id", string()),
arrow::field("start", int32()),
arrow::field("sample1", int32()),
arrow::field("sample2", int32()),
arrow::field("sample3", int32()),
arrow::field("sample4", int32()),
arrow::field("sample5", int32()),
arrow::field("sample6", int32()),
arrow::field("sample7", int32()),
arrow::field("sample8", int32()),
arrow::field("sample9", int32()),
arrow::field("sample10", int32())
)
open_dataset("dataset_dir", schema = my_schema)
# FileSystemDataset with 10 Parquet files
id: string
start: int32
sample1: int32
sample2: int32
sample3: int32
sample4: int32
sample5: int32
sample6: int32
sample7: int32
sample8: int32
sample9: int32
sample10: int32
The only problem now is that the number of rows in the dataset returned by open_dataset with this schema is a multiple of the number of parqet files written and the expected number of rows. If the full dataset had 10 rows and were written to 2 parquet files, nrow(dataset) would give 20 instead of 10. I have the first two columns written to each parquet file which may be causing this.
Instead of something like:
| id | start| end | sample1 | sample2 |
| --- | -----|-----|------ |---------|
| id1 | 1 | 3 | 0.2 | 0.5 |
| id2 | 4 | 6 | 0.5 | 0.1 |
| id. | ... | ... | ... | |
| id50 | 98 | 100 | 0.5 | 0.5 |
it looks like:
| id | start| end | sample1 | sample2 |
| --- | -----|-----|-------- |------ |
| id1 | 1 | 3 | 0.2 | NA |
| id2 | 4 | 6 | 0.5 | NA |
| id. | ... | ... | ... | NA |
| id50 | 98 | 100 | 0.5 | NA |
| id1 | 1 | 3 | NA | 0.5 |
| id2 | 4 | 6 | NA | 0.1 |
| id. | ... | ... | NA | ... |
| id50 | 98 | 100 | NA | 0.5 |
Its doing something closer to an rbind
than a cbind
on the parquet files. Is it possible to read in a set of parquet files with different schema that are merged column-wise?