42

I am learning python pandas. I see a tutorial which shows two ways to save a pandas dataframe.

  1. pd.to_csv('sub.csv') and to open pd.read_csv('sub.csv')

  2. pd.to_pickle('sub.pkl') and to open pd.read_pickle('sub.pkl')

The tutorial says to_pickle is to save the dataframe to disk. I am confused about this. Because when I use to_csv, I did see a csv file appears in the folder, which I assume is also save to disk right?

In general, why we want to save a dataframe using to_pickle rather than save it to csv or txt or other format?

KevinKim
  • 1,382
  • 3
  • 18
  • 34
  • Matthew Rocklin does an interesting speed analysis [here](http://matthewrocklin.com/blog/work/2015/03/16/Fast-Serialization) – dumbledad Jun 18 '19 at 10:12

2 Answers2

54

csv

  • ✅human readable
  • ✅cross platform
  • ⛔slower
  • ⛔more disk space
  • ⛔doesn't preserve types in some cases

pickle

  • ✅fast saving/loading
  • ✅less disk space
  • ⛔non human readable
  • ⛔python only

Also take a look at parquet format (to_parquet, read_parquet)

  • ✅fast saving/loading
  • ✅less disk space than pickle
  • ✅supported by many platforms
  • ⛔non human readable
Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219
artoby
  • 1,614
  • 16
  • 13
  • 1
    Also take a look at _feather_ format (`to_feather`, `read_feather`) According to a [TDS review](https://towardsdatascience.com/the-best-format-to-save-pandas-data-414dca023e0d) it "shows high I/O speed, doesn’t take too much memory on the disk and doesn’t need any unpacking when loaded back into RAM." – mirekphd Jun 28 '20 at 11:33
  • 3
    Thanks this answer is very concise and to the point. For a detailed breakdown, I found [this post](https://towardsdatascience.com/the-best-format-to-save-pandas-data-414dca023e0d) that does an indepth breakdown, including `to_feather` vs `to_parquet` – Hamman Samuel Jun 09 '21 at 16:05
  • 3
    one thing I would add into comparison is **[pickle incompatibility risk between different Python/pandas versions](https://stackoverflow.com/questions/37371451/importerror-no-module-named-pandas-indexes)** (CSV data will always remain readable). i.e., my workstation at office is old and uses Python 3.4: its highest pandas version cannot handle pickle pandas dataframes generated by my Python 3.8 at home. So CSV is a better choice when you cannot control all pandas versions that will be using your files. In summary: **CSV is** not only "cross platform" but **also "cross versions"** – abu Nov 25 '22 at 11:03
  • Another [useful article about pickle](https://towardsdatascience.com/stop-using-csvs-for-storage-pickle-is-an-80-times-faster-alternative-832041bbc199) – abu Nov 25 '22 at 11:20
  • Well, I loved the clarity of this beautiful answer, but I concur with Abu about the version issues with Pandas. It has tripped me up. Also, well I started to use the pickle approach for speed and space.... but strangely I do NOT find my pickles smaller. Just experimentally saved a strip of annotated ECG: 272.1MB (pickle), but 155.8MB (csv). That said, writing and reading the pickle took only 0.48s and the csv took a very irritating 11.49s !! – RichardBJ Feb 25 '23 at 17:50
34

Pickle is a serialized way of storing a Pandas dataframe. Basically, you are writing down the exact representation of the dataframe to disk. This means the types of the columns are and the indices are the same. If you simply save a file as csv, you are just storing it as a comma separated list. Depending on your data set, some information will be lost when you load it back up.

You can read more about pickle library in python, here.

Mostafa Ghadimi
  • 5,883
  • 8
  • 64
  • 102
Gabriel A
  • 1,779
  • 9
  • 12
  • 1
    So you mean, to_pickle should be more preferable when saving a pandas dataframe, i.e., it preserves the original dataframe? Are there any advantages of to_pickle? for example, in terms of loading speed? – KevinKim Feb 13 '18 at 15:54
  • 3
    @KevinKim, you may want to check [this comparison](https://stackoverflow.com/questions/37010212/what-is-the-fastest-way-to-upload-a-big-csv-file-in-notebook-to-work-with-python) – MaxU - stand with Ukraine Feb 13 '18 at 15:56
  • 2
    The main advantage of saving in CSV would be having a standardized format that can be opened with a wide range of software/languages – Alessandro Feb 13 '18 at 16:00
  • @MaxU Thanks! So if my original data set is a large csv file, I guess it would be good to first load it into pandas and then store it using to_pickle. Hence, next time when I need to load this dataframe again, I can use read_pickle to load it must faster, is that correct? – KevinKim Feb 13 '18 at 16:01
  • @Alessandro yes, that makes sense, I agree with you – KevinKim Feb 13 '18 at 16:02