Is it expected that a multiple IPC stream readers can concurrently tail a single stream writer publishing from another process? The descriptions of things like "IPC streams" leads me to think yes, but I cannot find any positive confirmation in the docs, and I dont see anything obvious in the source code aside from a std::mutex
protecting concurrent writes in the C++ writer process.
Asking because I have a C++ producer (arrow version 10.0.0) using arrow::ipc::MakeStreamWriter(arrow::io::FileOutputStream)
to write the data, and a python consumer (arrow v12) reading the stream using pa.ipc.open_stream(f)
... Most of the time, things flow through without issue, but occasionally the reader sees corrupt record batches that yield errors like the below. A re-read of the stream yields the correct data, which leads me to think this is a concurrency race condition.
_logger.exception(f'error writing batch {table_from_batches.to_pandas()}') File "pyarrow/array.pxi", line 837, in pyarrow.lib._PandasConvertible.to_pandas
File "pyarrow/table.pxi", line 2448, in pyarrow.lib.RecordBatch._to_pandas File "pyarrow/table.pxi", line 4114, in pyarrow.lib.Table._to_pandas File "../.venv/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 820, in table_to_blockmanager
blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
File "../.venv/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 1168, in _table_to_blocks
result = pa.lib.table_to_blocks(options, block_table, categories,
File "pyarrow/table.pxi", line 2771, in pyarrow.lib.table_to_blocks
File "pyarrow/error.pxi", line 127, in pyarrow.lib.check_status
pyarrow.lib.ArrowIndexError: Index 80 out of bounds