Best way to cache a pandas dataframe?

Question

Yesterday I learned the hard way that saving a pandas dataframe to csv for later use is a bad idea. I have a dataframe of +- 130k tweets, where one row of the dataframe is a list of tweets. When I saved the data to CSV and then loaded the dataframe back in, the rows of my dataframes are now of type String. This lead to all kinds of errors and a lot of debugging. Of course it was a stupid mistake to assume that CSV would be able to preserve information about which data structure type my data is.

My question now is: How do I save a dataframe for later use, in a way that information about which data types my columns/rows are is preserved?

Try [`DataFrame.to_pickle`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_pickle.html) — Chris Adams, Nov 25 '19 at 09:28

score 3 · Accepted Answer · answered Oct 16 '20 at 05:05

I hope you found the solution you were looking for.
To answer the question, one can use the DataFrame.to_pickle() method to serialize (convert python objects into byte streams), and when you de-serialize a pickle file, you get back the data as they were, but keep in mind when using pickle files, they may pose a security threat when received from untrusted sources.

Here's an example from the doc on how to use pickle:

>>> original_df = pd.DataFrame({"foo": range(5), "bar": range(5, 10)})
>>> original_df
   foo  bar
0    0    5
1    1    6
2    2    7
3    3    8
4    4    9

>>> pd.to_pickle(original_df, "./dummy.pkl")
>>> unpickled_df = pd.read_pickle("./dummy.pkl")
>>> unpickled_df
   foo  bar
0    0    5
1    1    6
2    2    7
3    3    8
4    4    9

Best way to cache a pandas dataframe?

1 Answers1

Linked