0

I'm writing a program to convert a tabular ("rownar") custom binary file format to Parquet using Arrow C++.

The core of the program works as follows:

ColInfo colinfo = ...; // fixed schema per file, not known at compile time
parquet::StreamWriter w = ...;

on_row(RowData row) {
  for (col : colinfo) {
    w << convertCell(row, col);
  }
}

Here, on_row is called by the input file parser for each row parsed.

This works fine but is pretty slow, with StreamWriter::<< being the bottleneck.

Question: What's an alternative to StreamWriter that's similarly easy to use but faster?

Constraints:

  • Can't change the callback-based interface shown above.
  • Input data doesn't fit into memory.

I've looked into the reader_writer{,2}.cc examples in the Arrow repository that use the WriteBatch interface. Is that the recommended way to quickly create Parquet files? If so, what's the recommended way to size row groups? Or is there an interface that abstracts away row groups, like with StreamWriter? And what's the recommended size num_values to WriteBatch?

Secondary question: What are some good opportunities to concurrently create the Parquet file? Can batches, chunks, columns, or row groups by written concurrently?

Jonas H.
  • 2,331
  • 4
  • 17
  • 23
  • If your current approach is too slow my best guess is that you'll want to pass batches (and not individual values) to the underlying column writers. So using `WriteBatch` seems reasonable but you will need to build up batches before submitting them (e.g. don't call `WriteBatch` with batches of size 1). For the parameters, profiling is your friend, but roughly L3 sized batches is probably a good idea for `WriteBatch`. For row groups, the larger the better for performance generally, if you have the RAM to accumulate that much in memory. The default for the stream writer is 512MB. – Pace Jan 03 '23 at 19:07
  • Have you confirmed, via profiling, that the bottleneck is indeed in the _encoding_ and not in the _writing to disk_? If the bottleneck is the disk then moving away from stream writer is unlikely to help. As for parallelization, you can, in theory, parallelize column encoding. I don't know if the APIs are setup for that though. batches/chunks/row groups probably cannot be parallelized. Not much has been done to parallelize the write path as the disk is often the bottleneck. Though there is probably some opportunity here, especially on a parallel filesystem like S3. – Pace Jan 03 '23 at 19:10
  • The trick is that writers (unlike readers) tend to be very much setup to write in sequence. E.g. you need to know what offset to start writing the second batch at but you won't know that until you've fully encoded the first batch. With reading you don't have this problem. You simply read all the offsets in one pass and then the batches you want (potentially in parallel) in a second pass. It's not impossible to interleave some write/encode but I don't think it's done today. – Pace Jan 03 '23 at 19:11
  • Thanks a lot! Speed is 70 MB/s and my MacBook should be able to do around 2 GB/s. I will double check with an in-memory destination though. If you post your comments as answer then I can mark as accepted answer. – Jonas H. Jan 03 '23 at 21:16

0 Answers0