3

I have a Pandas DataFrame with the following structure, which contains both numbers and numpy arrays of fixed shape:

import pandas as pd
import numpy as np

df = pd.DataFrame({"num":(23, 42), "list":(np.arange(3), np.arange(1,4))

Assuming I have large (more than 1 GB) amounts of this data that I would like to store and retrieve quickly, how should I go about storing it? If I use HDF5, the Numpy array gets pickled which will affect the ability to retrieve the data quickly. Is there some way to tell HDF5 how to store Numpy arrays? Alternatively, should I not be using HDF5 at all?

The following GitHub thread seems to suggest the following:

  1. Create a function that gets the desired Numpy array, which is stored in some other format [1]
  2. Create a class to inform HDF5 [2]

Both of these solutions seem oddly specific for how common I imagine this problem to be. Are there more general approaches? Am I just using the wrong tool?

Seanny123
  • 8,776
  • 13
  • 68
  • 124

1 Answers1

3

I mean something like this:

df_x = pd.concat([df.num, pd.DataFrame(np.vstack(df.list))], 
                 keys=["key", "arr"], axis=1)

the dataframe:

  key arr      
  num   0  1  2
0  23   0  1  2
1  42   1  2  3

convert back with:

pd.concat([df_x.key, pd.Series(tuple(df_x.arr.values), name='list')], axis=1)

   num       list
0   23  [0, 1, 2]
1   42  [1, 2, 3]
piRSquared
  • 285,575
  • 57
  • 475
  • 624
HYRY
  • 94,853
  • 25
  • 187
  • 187