Pandas to_csv export giving wrong values in a dataframe

Question

I am using pandas and have imported two csv.

df1 is

df2 is

The data type of df2 is

When i am doing some manipulation on df1 and df2 :

df3= pd.merge(df1, df2, how='left', on=['Origin City Code', 'DC'])

and then export it to csv

df3.to_csv("test.CSV")

then the sum of all the values under column "Volume" is NOT matching with sum of the values under columns of original df2. In-fact the sum in df3 is coming out to be more. I believe the issue is coming up due to floating point numbers. But is there any way to resolve it ?? I have gone through the following links but my question remains unanswered.

https://github.com/pydata/pandas/issues/2069

float64 with pandas to_csv

reading and writing csv in pandas changes cell values

Wrong decimal calculations with pandas

Here is the code files i am using:https://www.dropbox.com/s/kjpnhl7qtojes92/sample.rar?dl=0

Do you have repeat `['Origin City Code', 'DC']` rows in `df1`? That would cause there to be repeat volumes in the merged DataFrame, explaining the larger sum. — root, Sep 23 '16 at 17:49
I don't mean repeats as a whole, just on the `['Origin City Code', 'DC']` subset, i.e. `df.duplicated(subset=['Origin City Code', 'DC']).any()`. How much is the sum off by? Can you provide a reproducible example? I can't reproduce the error with the data you've shown. — root, Sep 23 '16 at 17:55
Can you show the values of the sums in question? How *much* more is the sum for `df3` than the sum for `df2`? — Mark Dickinson, Sep 23 '16 at 18:41
And what are `df2.shape` and `df3.shape`? Do the two dataframes have the same number of rows? — Mark Dickinson, Sep 23 '16 at 18:48
@Mark: df2.shape=(66179,7) and df3.shape = (331810, 5).. df2 has sum = 733775 and df 3 = 736557 — Ajeet, Sep 23 '16 at 18:53
So `df3` has many more rows than `df2`. Why do you expect the volume sums to be equal? BTW, that size difference in the results is very unlikely to be the result of floating-point error. — Mark Dickinson, Sep 23 '16 at 18:55
@MarkDickinson and root.. i have shared with you the sample files and codes ( Dropbox link updated in my original post). could you please take a look.. — Ajeet, Sep 23 '16 at 18:57

shawnheide · Accepted Answer · 2016-09-25T21:44:14.487

1

I looked at your files, as @root was saying above, in df1 the combination of Origin City Code and DC are not unique. For instance, there are two records with Origin City Code = GGN and DC = ASA.

If you want to check it out you can run the following code:

df1[df1.duplicated(subset=['Origin City Code', 'DC'], keep=False)].sort_values(['Origin City Code', 'DC'])

Here's the head of this output:

edited Sep 25 '16 at 21:44

answered Sep 23 '16 at 20:45

shawnheide

797
4
11

Nice answer. It's @root who was talking about non-uniqueness, though, not me. (Though I agree with root that it seems likely that's at least part of the issue.) – Mark Dickinson Sep 24 '16 at 08:04

Pandas to_csv export giving wrong values in a dataframe

1 Answers1