11

A common use case in machine/deep learning code that works on image and audio is to load and manipulate large datasets of images or audio segments. Almost always, the entries in these datasets are represented by an image/audio segment and metadata (e.g. class label, training/test instance, etc.).

For instance, in my specific use case of speech recognition, datasets are almost always composed of entries with properties such as:

  • Speaker ID (string)
  • Transcript (string)
  • Test data (bool)
  • Wav data (numpy array)
  • Dataset name (string)
  • ...

What is the recommended way of representing such a dataset in pandas and/or dask - emphasis on the wav data (in an image dataset, this would be the image data itself)?

In pandas, with a few tricks, one can nest a numpy array inside a column, but this doesn't serialize well and also won't work with dask. Seems this is an extremely common use-case but I can't find any relevant recommendations.

One can also serialize/deserialize these arrays to binary format (Uber's petastorm does something like this) but this seems to miss the point of libraries such as dask and pandas where automagic serialization is one of the core benefits.

Any practical comments, or suggestions for different methodologies are most welcome.

stav
  • 1,497
  • 2
  • 15
  • 40
  • 1
    With `HDF5`/`h5py` you can save the array as a `dataset` and the other items as attributes. Or put several datasets in a group. `pandas` uses `pytables` to store dataframes in `HDF5`. – hpaulj Mar 23 '19 at 16:28
  • @hpaulj why don't you post this as an actual answer? It seems the correct approach to me, and there are apparently (after 5s of googling) [existing GitHub projects](https://github.com/fordDeepDSP/hdf5_scripts) that match extremely well (i.e. ready made functions that process `.wav` as `np.array()` into `.hdf5`) to what @Stav has asked for. – Asmus Apr 15 '19 at 07:46
  • Aren't HDF5 files constrained to data that is of constant size/dimensional? A (very) common case of image and audio data-sets are entries of different lengths. – stav Apr 15 '19 at 08:29
  • While `pandas` is great it is not the right tool for speech processing, for several reasons. One is that `pandas` is basically is an in-memory database, and usually audio data size is of order of tens of gigabytes. It is simply impractical to keep all your raw data in the memory. Also, it is complicated to use pure Python for parallel computing, which is very desirable for computationally intensive tasks such as image/speech processing. This is why people prefer to exploit other technologies, only using Python on a top, where it really shines (Tensorflow is a great example). – igrinis Apr 15 '19 at 12:43
  • Pandas is in-mem but dask is not, and AFAIK considered a very practical tool for distributed work in Python. However, it's based on the same API and so, for instance, the combination with parquet does not allow nesting multidimensional arrays in cells of dask dataframes. – stav Apr 15 '19 at 13:40
  • @Stav, if you're looking for a form of data storage that allows to group datasets (usually arrays of arbitrary dimension/length) with associated metadata, then HDF5 is a good choice, especially once data becomes huge where you can read in `chunks` of data into memory. As far as I understood your question, you're only looking for a good I/O solution for potentially rather large data, that "talks nicely" to pandas. If you have more requirements, perhaps you could emphasise this more within your question? Or give an example of the arrays you expect to store? – Asmus Apr 21 '19 at 07:23
  • Thanks @Asmus, the issue is indeed around the nesting of ndarrays in pandas/dask. Indeed, when using the HDF5 backend in pandas with nested arrays an explicit warning is given about this. See [here](https://notebooks.azure.com/styagev/projects/scratch/html/nested_nparray_pandas.ipynb) – stav Apr 21 '19 at 08:20

2 Answers2

1

One (perhaps ugly) way, is to patch pandas and dask parquet API to support multi-dimensional arrays:

# these monkey-patches into the pandas and dask I/O API allow us to save multi-dimensional numpy
# arrays# in parquet format by serializing them into byte arrays

from dask import dataframe as dd
import pandas as pd
from io import BytesIO

def _patched_pd_read_parquet(*args, **kwargs):
    return _orig_pd_read_parquet(*args, **kwargs).applymap(
        lambda val: np.load(BytesIO(val)) if isinstance(val, bytes) else val)
_orig_pd_read_parquet = pd.io.parquet.PyArrowImpl.read
pd.io.parquet.PyArrowImpl.read = _patched_pd_read_parquet

def _serialize_ndarray(arr: np.ndarray) -> bytes:
    if isinstance(arr, np.ndarray):
        with BytesIO() as buf:
            np.save(buf, arr)
            return buf.getvalue()
    return arr

def _deserialize_ndarray(val: bytes) -> np.ndarray:
    return np.load(BytesIO(val)) if isinstance(val, bytes) else val

def _patched_pd_write_parquet(self, df: pd.DataFrame, *args, **kwargs):
    return _orig_pd_write_parquet(self, df.applymap(_serialize_ndarray), *args, **kwargs)
_orig_pd_write_parquet = pd.io.parquet.PyArrowImpl.write
pd.io.parquet.PyArrowImpl.write = _patched_pd_write_parquet

def _patched_dask_read_pyarrow_parquet_piece(*args, **kwargs):
    return _orig_dask_read_pyarrow_parquet_piece(*args, **kwargs).applymap(_deserialize_ndarray)
_orig_dask_read_pyarrow_parquet_piece = dd.io.parquet._read_pyarrow_parquet_piece
dd.io.parquet._read_pyarrow_parquet_piece = _patched_dask_read_pyarrow_parquet_piece

def _patched_dd_write_partition_pyarrow(df: pd.DataFrame, *args, **kwargs):
    return _orig_dd_write_partition_pyarrow(df.applymap(_serialize_ndarray), *args, **kwargs)
_orig_dd_write_partition_pyarrow = dd.io.parquet._write_partition_pyarrow
dd.io.parquet._write_partition_pyarrow = _patched_dd_write_partition_pyarrow

You can then use the tricks specified in the question to get nested arrays in pandas cells (in-memory), and the above will act as a "poor-man's" codec serializing the arrays into byte streams (which different serialization schemes such as parquet can handle)

stav
  • 1,497
  • 2
  • 15
  • 40
1

The data organisation that you have does indeed sound an awful lot like an xarray: multi-dimensional data, with regular coordinates along each of the dimensions and variable properties. xarray allows you to operate on your array in a pandas-like fashion (the docs are very detailed, so I won't go into it). Of note, xarray interfaces directly with Dask so that, as you operate on the high-level data structure, you are actually manipulating dask arrays underneath and so can compute out-of-core and/or distributed.

Although inspired by the netCDF hierarchical data representation (typically stored as HDF5 files), there are a number of possible storage options you could use, including zarr which is particularly useful as a cloud format for parallel access of the form Dask would like to use.

mdurant
  • 27,272
  • 5
  • 45
  • 74
  • Thanks @mdurant, although I fail to see how xarray answers the requirements. The datasets I described in the question are not multidimensional in their "labels" (i.e. entries do not share values accros different dimensions). Instead - the "values" themselves are multidimensional (e.g. a variable sized image or sound wave). If I'm wrong, it'd be great if you could share a small snippet of how to apply xarray to the examples in the question – stav Apr 17 '19 at 19:06