4

Yesterday I learned the hard way that saving a pandas dataframe to csv for later use is a bad idea. I have a dataframe of +- 130k tweets, where one row of the dataframe is a list of tweets. When I saved the data to CSV and then loaded the dataframe back in, the rows of my dataframes are now of type String. This lead to all kinds of errors and a lot of debugging. Of course it was a stupid mistake to assume that CSV would be able to preserve information about which data structure type my data is.

My question now is: How do I save a dataframe for later use, in a way that information about which data types my columns/rows are is preserved?

Psychotechnopath
  • 2,471
  • 5
  • 26
  • 47

1 Answers1

3

I hope you found the solution you were looking for.
To answer the question, one can use the DataFrame.to_pickle() method to serialize (convert python objects into byte streams), and when you de-serialize a pickle file, you get back the data as they were, but keep in mind when using pickle files, they may pose a security threat when received from untrusted sources.

Here's an example from the doc on how to use pickle:

>>> original_df = pd.DataFrame({"foo": range(5), "bar": range(5, 10)})
>>> original_df
   foo  bar
0    0    5
1    1    6
2    2    7
3    3    8
4    4    9

>>> pd.to_pickle(original_df, "./dummy.pkl")
>>> unpickled_df = pd.read_pickle("./dummy.pkl")
>>> unpickled_df
   foo  bar
0    0    5
1    1    6
2    2    7
3    3    8
4    4    9
Singh
  • 504
  • 4
  • 15