1

I have a large (5000+ rows) CSV file of transactions that we know contains some errors.

It has the following fields:

date         description      money_in     money_out   balance

01-01-2017   stringvalue        349            0        1000
02-01-2017   stringvalue         0           100         900
03-01-2017   stringvalue        10             0         890

To check which rows contains faulty data I've added the following code:

df['difference'] = df['money In'] - df['money Out']
df['BalanceDif'] = df['balance'] - df['balance'].shift()
df['RowCorrect'] = df['BalanceDif'].equals(df['difference'])

This gives the the following (somewhat puzzling) output (first columns left out):

Balance    difference  BalanceDif  RowCorrect  
682.36        30         30          False
758.36        76         76          False
708.36       -50        -50          False
707.57       -0.79       -0.79       False
712.57        5          5           False
762.57        50         50          False

Does anyone know what I am doing wrong, and why the 'df.RowCorrect' is returning the wrong value?

Jasper
  • 2,131
  • 6
  • 29
  • 61

2 Answers2

3

Notice that equals() looks at the equality of two dataframes or two series, so that:

df.difference.equals(df.BalanceDif)
# True

To achieve what you want, you can just do an element-wise comparison thus:

df['RowCorrect'] = df.difference == df.BalanceDif
df

    BalanceDif  difference  RowCorrect
0   30.00       30.00       True
1   76.00       76.00       True
2   -50.00      -50.00      True
3   -0.79       -0.79       True
4   5.00        5.00        True
5   50.00       50.00       True
Oriol Mirosa
  • 2,756
  • 1
  • 13
  • 15
2

You should avoid comparing the equality of floats. Check this answer for more details. Convert the relevant columns to Decimals with 2 float points. Your code should work fine afterwards.

Alternatively, you could convert the money_in and money_out columns to integers (in cents). I.e., df['money_in'] = df['money_in'].apply(lambda x: int(100*x)) df['money_out'] = df['money_in'].apply(lambda x: int(100*x))

before running the rest of your code.

  • Numpy has its own version: https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.isclose.html – roganjosh Jan 01 '18 at 23:06
  • 1
    It turned out that one of the values was a 8 decimal float, whereas the other was a 2 decimal one. Thanks for the tip! – Jasper Jan 01 '18 at 23:10