I have a CSV file such as this one (example.csv
).
STRING_COL,INT_1,INT_2,FLOAT,INT_3
Hello,9,65151651,3234.54848,7832
This is a string,2,5484651,34.234,-999
Another,2,62189548,51.51658,-999
Test,2,2131514,5.2156,-999
Ham,9,6546548,2.15,-999
String,9,3216546,2.15468,-999
Every cell has a different number of decimals. They can be strings as well or integers (Int64, Int8, ...). Then I also have a similar CSV, but with some values changed. I want to check de differences between both files.
Therefore I have written a code similar to this one to compare the values cell by cell:
import pandas as pd
df = pd.read_csv(
'example.csv', delimiter=',', comment='#', skip_blank_lines=True,
verbose=False, engine='python', dtype=str
)
df = df.apply(lambda x: pd.to_numeric(x, errors='ignore', downcast='integer'))
df_2 = pd.read_csv(
'example_2.csv', delimiter=',', comment='#', skip_blank_lines=True, # file with small changes
verbose=False, engine='python', dtype=str
)
df_2 = df_2.apply(lambda x: pd.to_numeric(x, errors='ignore', downcast='integer'))
for i in list(df.index):
for column in list(df.columns):
old = df.loc[i, column]
new = df_2.loc[i, column]
if old != new:
print('DIFFERENT VALUE >> INDEX: {} | OLD: {} | NEW: {}'.format(i, old, new))
If you run this example with the small CSV file I am pretty sure it is going to work well. But with a huge CSV file some weird things are happening. I do not understand why sometimes many values are truncated to this ones:
1.6440000000000001 >> original value 1.644
7.7189999999999985 >> original value 7.7189
Then if I compare them it finds that they are different and this is not true because the value is the same. What is happening? Is there a way to fix this? Is there a better way to compare values with DataFrames?
NOTE: Maybe I am doing anything wrong in other part of my original code, but I think I have written the most important and relevant.
NOTE 2: I take into account that the !=
operator does not work well with NaN
values. I use np.isnan
to check this changes.
Update. I do not need to compare and say "yes, it is equal" and "no, it is not equal". I need to retrieve the values that have changes cell by cell.