0

I am trying to write (append) panda frame into HDF file. I use h5py library. I drop the duplicates to reduce the size.

TestFrame = TestFrame.drop_duplicates()
print(TestFrame.shape)
print(TestFrame.info())
TestFrame.to_hdf("data.h5", key="dataset_01", mode="a")

The TestFrame.info() gives the following information:

(202496, 21) #shape of the frame
class 'pandas.core.frame.DataFrame'>
Int64Index: 202496 entries, 0 to 367949
Data columns (total 5 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   0       202496 non-null  object
 1   1       202496 non-null  object
 2   2       202496 non-null  object
 3   3       202496 non-null  object
 4   4       202496 non-null  object
dtypes: object(5)
memory usage: 17.8+ MB
None

I get the following error:

 File "C:\Users\<user>\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\core\generic.py", line 2490, in to_hdf
    pytables.to_hdf(
  File "C:\Users\<user>\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\pytables.py", line 282, in to_hdf
    f(store)
  File "C:\Users\<user>\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\pytables.py", line 265, in <lambda>
    f = lambda store: store.put(
  File "C:\Users\<user>\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\pytables.py", line 1030, in put
    self._write_to_group(
  File "C:\Users\<user>\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\pytables.py", line 1697, in _write_to_group
    s.write(
  File "C:\Users\<user>\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\pytables.py", line 3101, in write
    self.write_array(f"block{i}_values", blk.values, items=blk_items)
  File "C:\Users\<user>\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\pytables.py", line 2958, in write_array
    vlarr.append(value)
  File "C:\Users\<user>\AppData\Local\Programs\Python\Python38-32\lib\site-packages\tables\vlarray.py", line 525, in append
    sequence = atom.toarray(sequence)
  File "C:\Users\<user>\AppData\Local\Programs\Python\Python38-32\lib\site-packages\tables\atom.py", line 1083, in toarray
    buffer_ = self._tobuffer(object_)
  File "C:\Users\<user>\AppData\Local\Programs\Python\Python38-32\lib\site-packages\tables\atom.py", line 1216, in _tobuffer
    return pickle.dumps(object_, pickle.HIGHEST_PROTOCOL)
MemoryError

I tried using to_csv but it does not give any error. I want to use HDF5 file format.

  • Does this solution help? https://stackoverflow.com/questions/28068872/memoryerror-with-pickle-in-python – Jonno_FTW Jun 04 '21 at 04:29
  • @Jonno_FTW Yeah I tried to load the dataframe into `numpy` array but it also had limited dimension size. It gave me an error saying that it exceeds the limit of how much numpy can store. That is why I switched to dataframe that can handle large large data. – Searching Python Jun 04 '21 at 17:04

0 Answers0