0

I am using pandas and have imported two csv.

df1 is

enter image description here

df2 is

enter image description here

The data type of df2 is

enter image description here

When i am doing some manipulation on df1 and df2 :

df3= pd.merge(df1, df2, how='left', on=['Origin City Code', 'DC'])

and then export it to csv

df3.to_csv("test.CSV")

then the sum of all the values under column "Volume" is NOT matching with sum of the values under columns of original df2. In-fact the sum in df3 is coming out to be more. I believe the issue is coming up due to floating point numbers. But is there any way to resolve it ?? I have gone through the following links but my question remains unanswered.

https://github.com/pydata/pandas/issues/2069

float64 with pandas to_csv

reading and writing csv in pandas changes cell values

Wrong decimal calculations with pandas

Here is the code files i am using:https://www.dropbox.com/s/kjpnhl7qtojes92/sample.rar?dl=0

Community
  • 1
  • 1
Ajeet
  • 57
  • 8
  • 3
    Do you have repeat `['Origin City Code', 'DC']` rows in `df1`? That would cause there to be repeat volumes in the merged DataFrame, explaining the larger sum. – root Sep 23 '16 at 17:49
  • @root .. no repeated value in df1. I checked – Ajeet Sep 23 '16 at 17:52
  • 2
    I don't mean repeats as a whole, just on the `['Origin City Code', 'DC']` subset, i.e. `df.duplicated(subset=['Origin City Code', 'DC']).any()`. How much is the sum off by? Can you provide a reproducible example? I can't reproduce the error with the data you've shown. – root Sep 23 '16 at 17:55
  • Can you show the values of the sums in question? How *much* more is the sum for `df3` than the sum for `df2`? – Mark Dickinson Sep 23 '16 at 18:41
  • And what are `df2.shape` and `df3.shape`? Do the two dataframes have the same number of rows? – Mark Dickinson Sep 23 '16 at 18:48
  • @Mark: df2.shape=(66179,7) and df3.shape = (331810, 5).. df2 has sum = 733775 and df 3 = 736557 – Ajeet Sep 23 '16 at 18:53
  • So `df3` has many more rows than `df2`. Why do you expect the volume sums to be equal? BTW, that size difference in the results is very unlikely to be the result of floating-point error. – Mark Dickinson Sep 23 '16 at 18:55
  • @MarkDickinson and root.. i have shared with you the sample files and codes ( Dropbox link updated in my original post). could you please take a look.. – Ajeet Sep 23 '16 at 18:57

1 Answers1

1

I looked at your files, as @root was saying above, in df1 the combination of Origin City Code and DC are not unique. For instance, there are two records with Origin City Code = GGN and DC = ASA.

If you want to check it out you can run the following code:

df1[df1.duplicated(subset=['Origin City Code', 'DC'], keep=False)].sort_values(['Origin City Code', 'DC'])

Here's the head of this output:

enter image description here

shawnheide
  • 797
  • 4
  • 11
  • Nice answer. It's @root who was talking about non-uniqueness, though, not me. (Though I agree with root that it seems likely that's at least part of the issue.) – Mark Dickinson Sep 24 '16 at 08:04