1

I have a dataframe with data from 2 sources stacked (unsorted). Is there a way to identify rows which are unique (have no counterpart in other data source) or where the difference in value between both sources is above a tolerance level?

import pandas as pd
import numpy as np

df = pd.DataFrame({
        'date': ['2017-02-22', '2019-05-07', '2019-05-07', '2018-01-01', '2020-03-10', '2020-03-10'],
        'identifier': ['A', 'A', 'A', 'A', 'A', 'A'],
        'value': [123, 456, 455, 678, 999.9876, 900.1234],
        'source': ['x', 'x', 'y', 'y', 'x', 'y']
        })

tolerance = {'A': 2.50}
key = ['date', 'identifier']
df.sample(frac=1)  # to make it little more difficult, shuffle rows

         date identifier     value source
0  2017-02-22          A  123.0000      x  # NOK - unique
1  2019-05-07          A  456.0000      x  # OK - within tolerance
2  2019-05-07          A  455.0000      y  # OK - within tolerance
3  2018-01-01          A  678.0000      y  # NOK - unique
4  2020-03-10          A  999.9876      x  # NOK - outside tolerance
5  2020-03-10          A  900.1234      y  # NOK - outside tolerance

Expected output

df.iloc[[0,3,4,5]]    
    
         date identifier     value source
0  2017-02-22          A  123.0000      x
3  2018-01-01          A  678.0000      y
4  2020-03-10          A  999.9876      x
5  2020-03-10          A  900.1234      y

Implementation

df['new_value'] = np.where(df['source'] == 'y', -1 * df['value'], df['value'])
df.groupby(key).agg({'source': 'count', 'new_value': 'sum'})


Out[57]: 
                       source  new_value
date       identifier                   
2017-02-22 A                1   123.0000
2018-01-01 A                1  -678.0000
2019-05-07 A                2     1.0000
2020-03-10 A                2    99.8642

I do not know how to proceed further.

aeiou
  • 337
  • 1
  • 7

1 Answers1

1

You can use a self merge_asof:

df2 = df.sort_values(by='value')
out = pd.merge_asof(df2, df2.reset_index(),
                    by=['date', 'identifier'], on='value',
                    allow_exact_matches=False, direction='nearest',
                    tolerance=2.5).dropna(subset='index')

new_df = df.drop(out['index'])

output:

         date identifier     value source
0  2017-02-22          A  123.0000      x
3  2018-01-01          A  678.0000      y
4  2020-03-10          A  999.9876      x
5  2020-03-10          A  900.1234      y
mozway
  • 194,879
  • 13
  • 39
  • 75
  • Thanks. Do you sort intentionally by value? Wondering if `sort_values(by=key)` could speed up `merge_asof`. I seem to be also getting `TypeError: Index(...) must be called with a collection of some kind, 'index' was passed`. – aeiou Oct 10 '22 at 11:13
  • @aeiou yes, `merge_asof` requires that data is sorted by the merging key. Which line gives you the error? Which pandas version? – mozway Oct 10 '22 at 11:17
  • Version 1.3.5. It is caused by `.dropna(subset='index')`. – aeiou Oct 10 '22 at 11:19
  • 1
    Try `.dropna(subset=['index'])` – mozway Oct 10 '22 at 11:20