0

I have a large pandas DataFrame with individual elements that are complex numpy arrays. Please see below a minimal code example to reproduce the scenario:


d = {f'x{i}': [] for i in range(4)}
df = pd.DataFrame(data=d).astype(object)

for K in range(4): 
    for i in range(4): 

        df.loc[f'{K}', f'x{i}'] = np.random.random(size=(2,2)) + np.random.random(size=(2,2)) * 1j

df

What is the best way to save these and load them up again for use later?

The problem I am having is that when I increase the size of the matrices stored and the number of elements, I get an OverflowError when I try to save it as .h5 file as shown below:

import pandas as pd 

size = (300,300)
xs = 1500

d = {f'x{i}': [] for i in range(xs)}
df = pd.DataFrame(data=d).astype(object)

for K in range(10): 
    for i in range(xs): 

        df.loc[f'{K}', f'x{i}'] = np.random.random(size=size) + np.random.random(size=size) * 1j

df.to_hdf('test.h5', key="df", mode="w")

load_test = pd.read_hdf("test.h5", "df")
---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
<ipython-input-124-8cb8df1a0653> in <module>
     12         df.loc[f'{K}', f'x{i}'] = np.random.random(size=size) + np.random.random(size=size) * 1j
     13 
---> 14 df.to_hdf('test.h5', key="df", mode="w")
     15 
     16 

~/PQKs/pqks/lib/python3.6/site-packages/pandas/core/generic.py in to_hdf(self, path_or_buf, key, mode, complevel, complib, append, format, index, min_itemsize, nan_rep, dropna, data_columns, errors, encoding)
   2447             data_columns=data_columns,
   2448             errors=errors,
-> 2449             encoding=encoding,
   2450         )
   2451 

~/PQKs/pqks/lib/python3.6/site-packages/pandas/io/pytables.py in to_hdf(path_or_buf, key, value, mode, complevel, complib, append, format, index, min_itemsize, nan_rep, dropna, data_columns, errors, encoding)
    268             path_or_buf, mode=mode, complevel=complevel, complib=complib
    269         ) as store:
--> 270             f(store)
    271     else:
    272         f(path_or_buf)

~/PQKs/pqks/lib/python3.6/site-packages/pandas/io/pytables.py in <lambda>(store)
    260             data_columns=data_columns,
    261             errors=errors,
--> 262             encoding=encoding,
    263         )
    264 

~/PQKs/pqks/lib/python3.6/site-packages/pandas/io/pytables.py in put(self, key, value, format, index, append, complib, complevel, min_itemsize, nan_rep, data_columns, encoding, errors, track_times)
   1127             encoding=encoding,
   1128             errors=errors,
-> 1129             track_times=track_times,
   1130         )
   1131 

~/PQKs/pqks/lib/python3.6/site-packages/pandas/io/pytables.py in _write_to_group(self, key, value, format, axes, index, append, complib, complevel, fletcher32, min_itemsize, chunksize, expectedrows, dropna, nan_rep, data_columns, encoding, errors, track_times)
   1799             nan_rep=nan_rep,
   1800             data_columns=data_columns,
-> 1801             track_times=track_times,
   1802         )
   1803 

~/PQKs/pqks/lib/python3.6/site-packages/pandas/io/pytables.py in write(self, obj, **kwargs)
   3189             # I have no idea why, but writing values before items fixed #2299
   3190             blk_items = data.items.take(blk.mgr_locs)
-> 3191             self.write_array(f"block{i}_values", blk.values, items=blk_items)
   3192             self.write_index(f"block{i}_items", blk_items)
   3193 

~/PQKs/pqks/lib/python3.6/site-packages/pandas/io/pytables.py in write_array(self, key, value, items)
   3047 
   3048             vlarr = self._handle.create_vlarray(self.group, key, _tables().ObjectAtom())
-> 3049             vlarr.append(value)
   3050 
   3051         elif empty_array:

~/PQKs/pqks/lib/python3.6/site-packages/tables/vlarray.py in append(self, sequence)
    526             nparr = None
    527 
--> 528         self._append(nparr, nobjects)
    529         self.nrows += 1
    530 

~/PQKs/pqks/lib/python3.6/site-packages/tables/hdf5extension.pyx in tables.hdf5extension.VLArray._append()

OverflowError: value too large to convert to int

Zohim
  • 41
  • 5

1 Answers1

0

As noted in the similar issue https://stackoverflow.com/a/57133759/8896855, hdf/h5 files have more overhead and are intended to optimize many dataframes saved into a single file system. Feather and parquet objects will likely provide a better solution in terms of saving/loading a larger single dataframe as an in-memory object. In terms of the specific overflow error, this likely is the result of having larger mixed-type (as numpy array) columns stored in the "object" type in pandas. One (more complicated) option would be to split out the arrays in your dataframe into separate columns, but that's probably unnecessary.

A general quick fix would be to use df.to_pickle(r'path_to/filename.pkl'), but to_feather or to_parquet likely present more optimized solutions.

AlecZ
  • 546
  • 5
  • 9