A common use case in machine/deep learning code that works on image and audio is to load and manipulate large datasets of images or audio segments. Almost always, the entries in these datasets are represented by an image/audio segment and metadata (e.g. class label, training/test instance, etc.).
For instance, in my specific use case of speech recognition, datasets are almost always composed of entries with properties such as:
- Speaker ID (string)
- Transcript (string)
- Test data (bool)
- Wav data (numpy array)
- Dataset name (string)
- ...
What is the recommended way of representing such a dataset in pandas and/or dask - emphasis on the wav data (in an image dataset, this would be the image data itself)?
In pandas, with a few tricks, one can nest a numpy array inside a column, but this doesn't serialize well and also won't work with dask. Seems this is an extremely common use-case but I can't find any relevant recommendations.
One can also serialize/deserialize these arrays to binary format (Uber's petastorm does something like this) but this seems to miss the point of libraries such as dask and pandas where automagic serialization is one of the core benefits.
Any practical comments, or suggestions for different methodologies are most welcome.