0

I am trying to compare two data frames using datacompy package. And I am seeing something weird.

print(dfA)
          req_nbr          unit_cost_amt
0        24468868            1.36870
1        24468868            1.36870
2        24468868            1.64952
3        24468868            1.64952
4        24468868            0.83289
5        24468868            0.83289
6        24468868            0.83289
7        24468868            0.83289

Then I have another dataframe with same data and structure.

print(dfB)

          req_nbr          unit_cost_amt
0        24468868            1.36870
1        24468868            1.36870
2        24468868            1.64952
3        24468868            1.64952
4        24468868            0.83289
5        24468868            0.83289
6        24468868            0.83289
7        24468868            0.83289

Both dataframes are same and of same datatypes.

dfA['unit_cost_amt'].dtype
dtype('float64')

dfB['unit_cost_amt'].dtype
dtype('float64')

Now I m doing a compare using datacompy

        compare = datacompy.Compare(
                                dfA,
                                dfB,

                                # You can also specify a list of columns
                                join_columns = ['req_nbr'], 

                                # Optional, defaults to 0
                                abs_tol = 0,

                                # Optional, defaults to 0
                                rel_tol = 0, 

                                # Optional, defaults to 'df1'
                                df1_name = 'Old',

                                # Optional, defaults to 'df2'
                                df2_name = 'New' 
                                )
    print(compare.report())

And it shows differences...

DataComPy Comparison
--------------------

DataFrame Summary
-----------------

  DataFrame  Columns  Rows
0       Old        2     8
1       New        2     8

Column Summary
--------------

Number of columns in common: 2
Number of columns in Old but not in New: 0
Number of columns in New but not in Old: 0

Row Summary
-----------

Matched on: req_nbr
Any duplicates on match values: Yes
Absolute Tolerance: 0
Relative Tolerance: 0
Number of rows in common: 8
Number of rows in Old but not in New: 0
Number of rows in New but not in Old: 0

Number of rows with some compared columns unequal: 4
Number of rows with all compared columns equal: 4

Column Comparison
-----------------

Number of columns compared with some values unequal: 1
Number of columns compared with all values equal: 1
Total number of values which compare unequal: 4

Columns with Unequal Values or Types
------------------------------------

          Column Old dtype New dtype  # Unequal      Max Diff  # Null Diff
    0  unit_cost_amt   float64   float64          4  1.110223e-16            0

Sample Rows with Unequal Values
-------------------------------

          req_nbr             unit_cost_amt (Old)       unit_cost_amt (New)
6        24468868                  0.83289                  0.83289
7        24468868                  0.83289                  0.83289
4        24468868                  0.83289                  0.83289
5        24468868                  0.83289                  0.83289

Any idea what I am doing wrong here ? Its puzzling.

Johnson Francis
  • 249
  • 3
  • 17

1 Answers1

2

The max diff at 1e-16 indicates it’s a difference in the last mantissa bit, or some rounding / cancellation issue like that. This can happen depending on how the differences are computed (it really shouldn’t if the numbers are exactly the same).

You should set rel_tol or abs_tol to avoid such issues − that’s what these parameters are there for.

For example, datacompy.Compare(..., rel_tol=1e-10) means* that numbers a and b will be the considered the same if abs(a / b - 1) is smaller than 10^-10. This relative threshold is big enough to never happen by accident and small enough for most applications.

You can choose whatever threshold works for you. Since all your unit_cost_amt seem to have 5 digits, you could also use abs_tol=1e-6.

* It’s typically defined like that, but I didn’t actually read the datacompy docs

Cimbali
  • 11,012
  • 1
  • 39
  • 68