18

It seems strange to me that pandas.read_csv is not a direct reciprocal function to df.to_csv. In this illustration, notice how when using all the default settings the original and final DataFrames differ by the "Unnamed" column.

In [1]: import pandas as pd

In [2]: orig_df = pd.DataFrame({'AAA' : [4,5,6,7], 'BBB' : [10,20,30,40],'CCC' : [100,50,-30,-50]}); orig_df
Out[2]: 
   AAA  BBB  CCC
0    4   10  100
1    5   20   50
2    6   30  -30
3    7   40  -50

[4 rows x 3 columns]

In [3]: orig_df.to_csv('test.csv')

In [4]: final_df = pd.read_csv('test.csv'); final_df
Out[4]: 
   Unnamed: 0  AAA  BBB  CCC
0           0    4   10  100
1           1    5   20   50
2           2    6   30  -30
3           3    7   40  -50

[4 rows x 4 columns]

It seems the default read_csv should instead be

In [6]: final2_df = pd.read_csv('test.csv', index_col=0); final2_df
Out[7]: 
   AAA  BBB  CCC
0    4   10  100
1    5   20   50
2    6   30  -30
3    7   40  -50

[4 rows x 3 columns]

or the default to_csv should instead be

In [8]: df.to_csv('test2.csv', index=False)

which when read gives

In [9]: pd.read_csv('test2.csv')
Out[9]: 
   AAA  BBB  CCC
0    4   10  100
1    5   20   50
2    6   30  -30
3    7   40  -50

[4 rows x 3 columns]

(Perhaps this should instead be sent to the developer/s but I am genuinely interested why this is the default behavior. Hopefully it also can help someone else avoid the confusion I had).

Steven C. Howell
  • 16,902
  • 15
  • 72
  • 97
  • 3
    I think it's because before when you used `pd.DataFrame.from_csv` the default was indeed that `index_col=0` but this caused all kinds of havoc as csv's have all kinds of weird formats so this behaviour is different to `read_csv`. It's a good point and something worth posting as an improvement on [github](https://github.com/pydata/pandas/issues) – EdChum Jul 24 '15 at 22:31
  • Saying that really the reciprocal is [`from_csv`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.from_csv.html#pandas.DataFrame.from_csv) but it's not longer updated in favour of the general `read_table` and `read_csv` which have more flexibility – EdChum Jul 24 '15 at 22:45
  • 1
    I learned this the hard way with read_excel since there is no roundtrip, for example if you save a multi indexed excel you'll have a hard time getting it back into a dataframe – Skorpeo Jul 25 '15 at 03:11
  • It's often not clear to me either whether something goes here or at GitHub (or both), but I think this one definitely has a place at SO because you'll get a wider audience and I agree it's good to inform people about default behavior like this (and how to workaround it when needed). – JohnE Jul 25 '15 at 16:10

2 Answers2

5

Thanks for the tip to post to the github page @EdChum. This led me to the pandas.DataFrame.from_csv function which is indeed the reciprocal of pandas.DataFrame.to_csv.

In [6]: final_df = pd.DataFrame.from_csv('test.csv')

In [7]: final_df
Out[7]: 
   AAA  BBB  CCC
0    4   10  100
1    5   20   50
2    6   30  -30
3    7   40  -50

[4 rows x 3 columns]
Steven C. Howell
  • 16,902
  • 15
  • 72
  • 97
1

As mentioned above, pd.FataFrame.from_csv is no longer supported. The reciprocal of from_csv is: pd.read_csv(file_name, index_col=0).

For example:

import pandas as pd

df = pd.DataFrame({'name': ['Raphael', 'Donatello'],

                   'mask': ['red', 'purple'],

                   'weapon': ['sai', 'bo staff']})

file_name = "df.csv"
csv_df = df.to_csv(file_name)

reconstructed_df = pd.read_csv("df.csv", index_col=0)

print(reconstructed_df)

# will print
        name    mask    weapon
0    Raphael     red       sai
1  Donatello  purple  bo staff
Shir
  • 1,571
  • 2
  • 9
  • 27