I have a Pandas DataFrame with the following structure, which contains both numbers and numpy arrays of fixed shape:
import pandas as pd
import numpy as np
df = pd.DataFrame({"num":(23, 42), "list":(np.arange(3), np.arange(1,4))
Assuming I have large (more than 1 GB) amounts of this data that I would like to store and retrieve quickly, how should I go about storing it? If I use HDF5, the Numpy array gets pickled which will affect the ability to retrieve the data quickly. Is there some way to tell HDF5 how to store Numpy arrays? Alternatively, should I not be using HDF5 at all?
The following GitHub thread seems to suggest the following:
- Create a function that gets the desired Numpy array, which is stored in some other format [1]
- Create a class to inform HDF5 [2]
Both of these solutions seem oddly specific for how common I imagine this problem to be. Are there more general approaches? Am I just using the wrong tool?