Equal-looking Python data frames aren't equal

Question

I am following an online Python course, which is getting into data frames.

I downloaded this CSV file and imported it into a data frame:

import os
import pandas as pd
os.chdir('C:/cygwin64/home/User.Name/path/to/brics.csv')
pd.read_csv( os.getcwd() + '/brics.csv' )
myBrics = pd.read_csv( 'brics.csv' )
myBrics

      Unnamed: 0       country    capital    area  population
    0         BR        Brazil   Brasilia   8.516      200.40
    1         RU        Russia     Moscow  17.100      143.50
    2         IN         India  New Delhi   3.286     1252.00
    3         CH         China    Beijing   9.597     1357.00
    4         SA  South Africa   Pretoria   1.221       52.98

I then used the code given in the course presentation to create the same data frame

dict = {
   "country":["Brazil", "Russia", "India", "China", "South Africa"],
   "capital":["Brasilia", "Moscow", "New Delhi", "Beijing", "Pretoria"],
   "area":[8.516, 17.10, 3.286, 9.597, 1.221],
   "population":[200.4, 143.5, 1252, 1357, 52.98] }
brics = pd.DataFrame(dict)
brics

            country    capital    area  population
    0        Brazil   Brasilia   8.516      200.40
    1        Russia     Moscow  17.100      143.50
    2         India  New Delhi   3.286     1252.00
    3         China    Beijing   9.597     1357.00
    4  South Africa   Pretoria   1.221       52.98

They appear to be identical except that for the first column in myBrics. Some web searching showed that I can get rid of column 1:

myBrics.drop( myBrics.columns[[0]] , axis=1 )

            country    capital    area  population
    0        Brazil   Brasilia   8.516      200.40
    1        Russia     Moscow  17.100      143.50
    2         India  New Delhi   3.286     1252.00
    3         China    Beijing   9.597     1357.00
    4  South Africa   Pretoria   1.221       52.98

However, the identical looking data frames are still not equal:

myBrics.drop( myBrics.columns[[0]] , axis=1 ).equals( brics )

    False

Can anyone please explain what is going on? Thanks.

I am using Python 3.7 from Spyder, installed (by someone with administrator rights) via Anaconda. The OS is Windows 7 64-bit.

You can use `from pandas.util.testing import assert_frame_equal`, then `assert_frame_equal(myBrics, brics)` to get more information about what isn't the same. — m13op22, Aug 20 '19 at 19:38

score 2 · Accepted Answer · answered Aug 20 '19 at 19:43

You're relying on equality of floating point values returning true; there's many resources out there to explain why that doesn't work as expected.

I'd recommend importing numpy and using the isclose function on the floating point number columns

add this to your imports

import numpy as np

and then use the following:

eq = np.isclose(myBrics['area'], brics['area'])

if you want to go more into the details around what is going on with floats, see this answer

score 1 · Answer 2 · answered Aug 20 '19 at 19:26

1

I suspect it's the dtype of your columns. As the docs mention:

The column headers do not need to have the same type, but the elements within the columns must be the same dtype.

You can use:

dataframe.dtypes

To see what datatype each column is

answered Aug 20 '19 at 19:26

alex067

3,159
1
11
17

score 1 · Answer 3 · answered Aug 20 '19 at 19:52

Allan Elder's answer is correct. I ran this code:

import os
import pandas as pd
myBrics = pd.read_csv( 'brics.csv' )
dict = {
     "country":["Brazil", "Russia", "India", "China", "South Africa"],
     "capital":["Brasilia", "Moscow", "New Delhi", "Beijing", "Pretoria"],
     "area":[8.516, 17.10, 3.286, 9.597, 1.221],
     "population":[200.4, 143.5, 1252, 1357, 52.98] }
brics = pd.DataFrame(dict)
myBrics = myBrics.drop( myBrics.columns[[0]] , axis=1 )
print (myBrics['area'].equals(brics['area']))

The result was

False

user36800 · Answer 4 · 2019-08-20T20:35:45.277

Quantization error is responsible for the discrepancy. Here is the series of troubleshooting steps suggetsed by respondents:

import os
import pandas as pd
os.chdir('C:/cygwin64/home/User.Name/path/to/brics.csv')
pd.read_csv( os.getcwd() + '/brics.csv' )
myBrics = pd.read_csv( 'brics.csv' )
myBrics

     Unnamed: 0       country    capital    area  population
   0         BR        Brazil   Brasilia   8.516      200.40
   1         RU        Russia     Moscow  17.100      143.50
   2         IN         India  New Delhi   3.286     1252.00
   3         CH         China    Beijing   9.597     1357.00
   4         SA  South Africa   Pretoria   1.221       52.98

dict = {
 "country":["Brazil", "Russia", "India", "China", "South Africa"],
 "capital":["Brasilia", "Moscow", "New Delhi", "Beijing", "Pretoria"],
 "area":[8.516, 17.10, 3.286, 9.597, 1.221],
 "population":[200.4, 143.5, 1252, 1357, 52.98] }
brics = pd.DataFrame(dict)
brics
           country    capital    area  population
   0        Brazil   Brasilia   8.516      200.40
   1        Russia     Moscow  17.100      143.50
   2         India  New Delhi   3.286     1252.00
   3         China    Beijing   9.597     1357.00
   4  South Africa   Pretoria   1.221       52.98

alex067 suggested hecking data types, which shows they are the same:

brics.dtypes

   Out[14]:
   country        object
   capital        object
   area          float64
   population    float64
   dtype: object

myBrics.dtypes

   Out[15]:
   Unnamed: 0     object
   country        object
   capital        object
   area          float64
   population    float64
   dtype: object

HS-nebula suggested using assert_frame_equal to see where the differences lie:

from pandas.util.testing import assert_frame_equal
assert_frame_equal(myBrics.drop( myBrics.columns[[0]] , axis=1 ), brics)
    # Reports no differences

Josh and Allan Elder said that the difference was due to quantization error:

import numpy as np
np.isclose(myBrics['area'], brics['area'])

   array([ True,  True,  True,  True,  True])

brics['area'] - myBrics['area']

   0    0.000000e+00
   1    0.000000e+00
   2    0.000000e+00
   3   -1.776357e-15
   4    2.220446e-16
   Name: area, dtype: float64

This means that pd.read_csv quantizes the text representation of numerical data differently than the combination of dict and pd.DataFrame. Likely, dict is responsible for the quantization. I find this inconsistency to be somewhat disturbing, but c'est la vie.

Thank you all!

Equal-looking Python data frames aren't equal

4 Answers4