1

I have a CSV file such as this one (example.csv).

STRING_COL,INT_1,INT_2,FLOAT,INT_3
Hello,9,65151651,3234.54848,7832
This is a string,2,5484651,34.234,-999
Another,2,62189548,51.51658,-999
Test,2,2131514,5.2156,-999
Ham,9,6546548,2.15,-999
String,9,3216546,2.15468,-999

Every cell has a different number of decimals. They can be strings as well or integers (Int64, Int8, ...). Then I also have a similar CSV, but with some values changed. I want to check de differences between both files.

Therefore I have written a code similar to this one to compare the values cell by cell:

import pandas as pd

df = pd.read_csv(
    'example.csv', delimiter=',', comment='#', skip_blank_lines=True,
    verbose=False, engine='python', dtype=str
)
df = df.apply(lambda x: pd.to_numeric(x, errors='ignore', downcast='integer'))

df_2 = pd.read_csv(
    'example_2.csv', delimiter=',', comment='#', skip_blank_lines=True,  # file with small changes
    verbose=False, engine='python', dtype=str
)
df_2 = df_2.apply(lambda x: pd.to_numeric(x, errors='ignore', downcast='integer'))

for i in list(df.index):
    for column in list(df.columns):
        old = df.loc[i, column]
        new = df_2.loc[i, column]
        if old != new:
            print('DIFFERENT VALUE >> INDEX: {} | OLD: {} | NEW: {}'.format(i, old, new))

If you run this example with the small CSV file I am pretty sure it is going to work well. But with a huge CSV file some weird things are happening. I do not understand why sometimes many values are truncated to this ones:

1.6440000000000001  >> original value 1.644
7.7189999999999985  >> original value 7.7189

Then if I compare them it finds that they are different and this is not true because the value is the same. What is happening? Is there a way to fix this? Is there a better way to compare values with DataFrames?

NOTE: Maybe I am doing anything wrong in other part of my original code, but I think I have written the most important and relevant.

NOTE 2: I take into account that the != operator does not work well with NaN values. I use np.isnan to check this changes.

Update. I do not need to compare and say "yes, it is equal" and "no, it is not equal". I need to retrieve the values that have changes cell by cell.

ChesuCR
  • 9,352
  • 5
  • 51
  • 114
  • 1
    You are running into floating point issues. It is never a good idea to compare floating point values with `==` or `!=`. A better solution would be to define an absolute or relative epsilon (choose the one appropriate for your problem) and do `abs(a - b) < epsilon` for the comparison. See this, e.g. https://stackoverflow.com/questions/5595425/what-is-the-best-way-to-compare-floats-for-almost-equality-in-python – sobek Sep 22 '18 at 21:00
  • I do not need to confirm equality. I need to retrieve the values that have changes cell by cell. I have updated the question to clarify it. Thanks @sobek to point to the `epsilon` value, I am reading about it – ChesuCR Sep 22 '18 at 21:20
  • @sobek anyway I found a real duplicated. I have answered my own question. Take a look if you want to add anything. Thanks – ChesuCR Sep 22 '18 at 23:36

1 Answers1

0

Finally I found something to make the comparison in an appropriate way: np.isclose(). I have read the duplicated question I found and some other questions about the epsilon value: numpy.finfo(), epsilon

Epsilon: Numbers which differ by less than machine epsilon are numerically the same

    abs(a - b) < epsilon
    absolute(a - b) <= (atol + rtol * absolute(b))      # np.isclose() method

So I need to make something like this. I have to check what happens if I am comparing between float32 and float64 or float16

eps64 = np.finfo(np.float64).eps
for col in df.columns:
    np.isclose(
        df[col],
        df_2[col],
        equal_nan=False,
        atol=0.0,
        rtol=eps64
    )

But now I am facing the problem that if I want copy the value to other variable I copy the inaccurate value 1.6440000000000001. What I am doing now to fix this is to cast the value to float >> float(1.6440000000000001)

ChesuCR
  • 9,352
  • 5
  • 51
  • 114