I want to write a parquet file that has some normal columns with 1d array data and some columns that have nested structure, i.e. 2d arrays.
I have tried the following:
import pyarrow as pa
import pyarrow.parquet as pq
import numpy as np
array1 = np.array([0, 1, 2], dtype=np.uint8)
array2 = np.array([[0,1,2], [3, 4, 5]], dtype=np.uint8).T
t1 = pa.uint8()
t2 = pa.list_(pa.uint8())
fields = [
pa.field('a1', t1),
pa.field('a2', t2)
]
myschema = pa.schema(fields)
mytable = pa.Table.from_arrays([
pa.array(array1, type=t1),
pa.array([array2[:,0], array2[:,1]], type=t2)],
schema=myschema)
pq.write_table(mytable, 'example.parquet')
The table creation works as expected. The last line is where the issue lies. It causes the Python interpreted to crash.
On windows Python 3.6.4 64-bit I get the error code: EDIT: using pyarrow 0.11.1
Process finished with exit code -1073741819 (0xC0000005)
I have also tried in Windows Linux (WSL) using a separate install of Python 3.6.5 64-bit and I get: EDIT: using pyarrow 0.12.1
Segmentation fault (core dumped)
I have seen this post suggesting to reinstall Python, but since I've tried with two different installs so far I don't think this will help.
I can't see anything in the PyArrow docs to suggest writing nested arrays to Parquet doesn't work, I know there are issues with this in fastparquet