Since I got memory error while I was concatenating pandas dataframes, I decided to write pandas dataframes into a binary file in append mode and then read this binary file to get the whole dataframe.
However, I got 'ValueError: cannot create an OBJECT array from memory buffer'
If all dataframes have numeric columns, this problem does not occur. However if one of the columns is string (in my case, there are many string columns in my dataframes), then this value error pops up. Here is the code below to exemplify this situation. Uncomment #works1 or #works2 to see that there is no error. But using the dataframe under #does not work gives ValueError
import pandas as pd
import numpy as np
mtot=0
if os.path.exists('df_all.bin'):
os.remove('df_all.bin')
for i in range(2):
#works1
# df = pd.DataFrame(np.random.randint(100, size=(5, 2)))
#works2
# df = pd.DataFrame({'A':[1,2,3], 'B':[1,2,3], 'C':[1.0,2.0,3.0]})
# df = df.astype(dtype={'A': int, 'B': int, 'C': float})
#does not work
df = pd.DataFrame({'A':[1,2,3], 'B':['sample1','sample2','sample3'], 'C':[1.0,2.0,3.0]})
df = df.astype(dtype={'A': int, 'B': str, 'C': float})
typ = df.values.dtype
print('dtype:%s' %typ)
with open('df_all.bin', 'ab') as f:
m, n = df.shape
mtot += m
f.write(df.values.tobytes())
with open('df_all.bin', 'rb') as f:
buffer = f.read()
nparray = np.frombuffer(buffer, dtype=typ)
data = nparray.reshape(mtot, n)
whole_df = pd.DataFrame(data=data, columns=list(range(n)))
print(whole_df)
print(whole_df.shape)
os.remove('df_all.bin')
How to get rid of this ValueError?
Thanks