I'm writing a program to convert a tabular ("rownar") custom binary file format to Parquet using Arrow C++.
The core of the program works as follows:
ColInfo colinfo = ...; // fixed schema per file, not known at compile time
parquet::StreamWriter w = ...;
on_row(RowData row) {
for (col : colinfo) {
w << convertCell(row, col);
}
}
Here, on_row
is called by the input file parser for each row parsed.
This works fine but is pretty slow, with StreamWriter::<<
being the bottleneck.
Question: What's an alternative to StreamWriter
that's similarly easy to use but faster?
Constraints:
- Can't change the callback-based interface shown above.
- Input data doesn't fit into memory.
I've looked into the reader_writer{,2}.cc
examples in the Arrow repository that use the WriteBatch
interface. Is that the recommended way to quickly create Parquet files? If so, what's the recommended way to size row groups? Or is there an interface that abstracts away row groups, like with StreamWriter
? And what's the recommended size num_values
to WriteBatch
?
Secondary question: What are some good opportunities to concurrently create the Parquet file? Can batches, chunks, columns, or row groups by written concurrently?