I want to store 2D arrays of different length as an AwkwardArray, store them as Parquet, and later access them again.
The problem is that, after loading from Parquet, the format is BitMaskedArray
and the access performance is a bit slow. Demonstrated by the following code:
import numpy as np
import awkward as awk
# big to feel performance (imitating big audio file); 2D
np_arr0 = np.arange(20000000, dtype=np.float32).reshape(2, -1)
print(np_arr0.shape)
# (2, 10000000)
# different size
np_arr1 = np.arange(20000000, 36000000, dtype=np.float32).reshape(2, -1)
print(np_arr1.shape)
# (2, 8000000)
# slow; turn into AwkwardArray
awk_arr = awk.fromiter([np_arr0, np_arr1])
# fast; returns np.ndarray
awk_arr[0][0]
# store and load from parquet
awk.toparquet("sample.parquet", awk_arr)
pq_array = awk.fromparquet("sample.parquet")
# kinda slow; return BitMaskedArray
pq_array[0][0]
If we inspect the return, we see:
pq_array[0][0].layout
# layout
# [ ()] BitMaskedArray(mask=layout[0], content=layout[1], maskedwhen=False, lsborder=True)
# [ 0] ndarray(shape=1250000, dtype=dtype('uint8'))
# [ 1] ndarray(shape=10000000, dtype=dtype('float32'))
# trying to access only float32 array [1]
pq_array[0][0][1]
# expected
# array([0.000000e+00, 1.000000e+00, 2.000000e+00, ..., 9.999997e+06, 9.999998e+06, 9.999999e+06], dtype=float32)
# reality
# 1.0
Question
How can I load AwkwardArray from Parquet and quickly access the numpy values?
Info from README (GitHub)
awkward.fromparquet
is lazy-loading the Parquet file.
Good that's what will help when doing e.g. pq_array[0][0][:1000]
The next layer of new structure is that the jagged array is bit-masked. Even though none of the values are nullable, this is an artifact of the way Parquet formats columnar data.
I guess there is no way around this. However, is this the reason why loading is kinda slow? Can I still access the data as numpy.ndarray
by directly accessing it (no bitmasked)?
Additional attempt
Loading it with Arrow, then Awkward:
import pyarrow as pa
import pyarrow.parquet as pq
# Parquet as Arrow
pa_array = pq.read_table("sample.parquet")
# returns table instead of JaggedArray
awk.fromarrow(pa_array)
# <Table [<Row 0> <Row 1>] at 0x7fd92c83aa90>