I am trying to store some extra information with DataFrames directly in the same DataFrame, such as some parameters describing the data stored.
I added this information just as extra attributes to the DataFrame:
df.data_origin = 'my_origin'
print(df.data_origin)
But when it is saved and loaded, those extra attributes are lost:
df.to_pickle('pickle_test.pkl')
df2 = pd.read_pickle('pickle_test.pkl')
print(len(df2))
print(df2.definition)
...
465387
>>> AttributeError: 'DataFrame' object has no attribute 'definition'
The workaround I have found is to save the dict of the DataFrame and then assign it to the dict of an empty DataFrame:
with open('modified_dataframe.pkl', "wb") as pkl_out:
pickle.dump(df.__dict__, pkl_out)
df2 = pd.DataFrame()
with open('modified_dataframe.pkl', "rb") as pkl_in:
df2.__dict__ = pickle.load(pkl_in)
print(len(df2))
print(df2.data_origin)
...
465387
my_origin
It seems to work, but:
- Is there a better way to do it?
- Am I losing information? (apparently, all the data is there)
- Here a different solution is discussed, but I would like to know if the approach of saving the dict of a class is valid to hold its entire information.
EDIT: Ok, I found the big drawback. This works fine to save single DataFrames in isolated files, but will not work if I have dictionaries, lists or similar with DataFrames in them.