4

I want to save pandas table in a file, so I can read it from that file later. My requirements:

  • the file format should be decently portable (good library support on Windows/Linux in major languages)

  • the DataFrame I read should be absolutely identical to the one I saved

According to this post, read_csv and to_csv may work if I provide index_col=0 argument, but the datatypes are lost (and of course, automatic type inference doesn't guarantee to give me the same types even for simple types, not to mention if I use python objects like lists which are never inferred).

Is there some simple solution that just works for sure, without having to worry about many edge cases?

The only solution I can think of, is using to_csv / read_csv, but save type information somewhere else. Still, I'm afraid there might more hidden problems (like duplicate column names, etc.).

Community
  • 1
  • 1
max
  • 49,282
  • 56
  • 208
  • 355
  • @tzaman I guess it's related, but that question is focused on speed, and the top/accepted answer is completely inappropriate in my case since I'm looking for portability. (pickle files can't be read outside of python, not easily). – max Aug 12 '16 at 22:09
  • 1
    That same answer also mentions `hdf5`. Does that not satisfy? – piRSquared Aug 12 '16 at 22:21
  • @piRSquared Yup just checked and it works. (Apart from same-name columns which are not allowed, but it's ok.) I didn't see any guarantee in the docs that HDF5 read/write are invertible, but I guess it just happens to be.. – max Aug 12 '16 at 23:14
  • I use it regularly. It's very fast and portable. Only thing I can't verify is strong support from other languages. But I do see on wikipedia that it is supported widely. – piRSquared Aug 12 '16 at 23:15
  • @piRSquared yes, definitely perfect solution. – max Aug 12 '16 at 23:17
  • This is not the same question as the duplicate. It is higher quality. The other question asks for speed, and many answers (including the one answer to this one) aren't portable or were listed as experimental at the time the answer was written. In the interested of ensuring StackOverflow is a useful `key→value` store, we should probably re-open and properly answer this question. (Unless a *true* and *higher quality* duplicate that predates this question can be found). – MRule Feb 01 '23 at 17:22
  • I researched and tested most of the formats supported by pandas. I've prepared markdown for a fairly thorough answer, but won't be able to provide it as an answer as long as this remains locked. – MRule Feb 01 '23 at 18:44

1 Answers1

-3

pd.DataFrame.to_pickle / pd.read_pickle hold columns data types. Let's check it out:

df_in.to_pickle('input_5')
df_out = pd.read_pickle('/input_5')
ragesz
  • 9,009
  • 20
  • 71
  • 88