Pyarrow write nested array to parquet

Question

I want to write a parquet file that has some normal columns with 1d array data and some columns that have nested structure, i.e. 2d arrays.

I have tried the following:

import pyarrow as pa
import pyarrow.parquet as pq
import numpy as np

array1 = np.array([0, 1, 2], dtype=np.uint8)
array2 = np.array([[0,1,2], [3, 4, 5]], dtype=np.uint8).T

t1 = pa.uint8()
t2 = pa.list_(pa.uint8())

fields = [
    pa.field('a1', t1),
    pa.field('a2', t2)
]

myschema = pa.schema(fields)

mytable = pa.Table.from_arrays([
    pa.array(array1, type=t1),
    pa.array([array2[:,0], array2[:,1]], type=t2)],
    schema=myschema)

pq.write_table(mytable, 'example.parquet')

The table creation works as expected. The last line is where the issue lies. It causes the Python interpreted to crash.

On windows Python 3.6.4 64-bit I get the error code: EDIT: using pyarrow 0.11.1

Process finished with exit code -1073741819 (0xC0000005)

I have also tried in Windows Linux (WSL) using a separate install of Python 3.6.5 64-bit and I get: EDIT: using pyarrow 0.12.1

Segmentation fault (core dumped)

I have seen this post suggesting to reinstall Python, but since I've tried with two different installs so far I don't think this will help.

I can't see anything in the PyArrow docs to suggest writing nested arrays to Parquet doesn't work, I know there are issues with this in fastparquet

This seems like an issue in `pyarrow`. Can you check that this still persists with the latest 0.12.1 release? If so, please report an issue at https://issues.apache.org/jira/projects/ARROW/issues — Uwe L. Korn, Mar 04 '19 at 20:03
Have edited and added pyarrow versions. The test in WSL was with 0.12.1 — S.B.G, Mar 05 '19 at 10:50
Issue created https://issues.apache.org/jira/browse/ARROW-4774 — S.B.G, Mar 05 '19 at 11:19

Pyarrow write nested array to parquet

0 Answers0