5

I want to create 2 new columns which would give me the closest value and ID to a certain value. This is how my df in python is structured:

x_time    expiration    x_price    p_time    p_price
 100          4          55.321     100        21
 105          4          51.120     105        25
 110          4          44.412     110        33.1
 100          5           9.1       100        3.1
 105          5           9.5       105        5.1
 110          5           8.2       110        12.1 
 100          6           122.1     100        155.9
 105          6           144.1     105        134.2 
 .......

Essentially, I want to create a new column (called 'closest_time' & 'closest_price') which would be the closest p_price to the x_price for that group only (hence the group by expiration)

So, the expected results would look like this:

x_time    expiration    x_price    p_time    p_price   closest_price closest_p_time
 100          4          55.321     100        21           33.1       110
 105          4          51.120     105        25           33.1       110
 110          4          44.412     110        33.1         33.1       110
 100          5           9.1       100        3.1          12.1       110
 105          5           9.5       105        5.1          12.1       110
 110          5           8.2       110        12.1          5.1       105
 100          6           122.1     100        155.9       134.2       105
 105          6           144.1     105        134.2       134.22      100

Hopefully, this somewhat makes sense.

I have thought of potential way to go about doing this:

  1. use for-loops..
    • first loop by expiration
    • then parse through p_price and compare all values to each x_price and select the closest one (min(abs())
    • however, it seems like the longest way of going about it - if there is a way to vectorize this, that would be ideal!

however, I have not been successful.

Thank you!

yungpadewon
  • 237
  • 3
  • 10
  • 1
    Rows are independent of each other generally. One reasoning is that if your datalist is incomplete (paging etc) you may not have an accurate list of information in which to filter against. That being said, given that information, you may find the answer you are looking for by defining ordered list of numbers and then getting the value at index +1, where index is the index of the value for that row. – Fallenreaper Apr 16 '19 at 19:14
  • isn't `144.1` closer to `134.2` in expiration 6 ? assuming it's a typo ? : ) – Umar.H Apr 16 '19 at 20:09
  • 1
    @Datanovice , correct. thank you for that , DataEXPERT! – yungpadewon Apr 16 '19 at 20:10
  • hope my attemp makes sense, i'm waiting for the experts to try there hand to see a better solution. – Umar.H Apr 16 '19 at 20:20
  • 1
    @Datanovice , I will try your attempt when im at my terminal. regardless, thank you v much for taking your time to help, greatly appreciated! will get back to you – yungpadewon Apr 16 '19 at 20:21
  • no problemo, p.s isn't the cloest `p-time` for group 4 100? (what was returned in my attempt) that's the only thing that confused me and possibly others. – Umar.H Apr 16 '19 at 20:45

3 Answers3

3

I think a nice solution is this:

df['closest_price'] = \
    df.apply(lambda x: df[df.p_price <= x.x_price]['p_price'].max(), axis=1)
tvgriek
  • 1,215
  • 9
  • 20
  • I will try this on my df in a few, but does this solution incorporate groupby? – yungpadewon Apr 16 '19 at 19:12
  • perhaps I can add df.groupby('expiration').apply(your func?) ?? @tvgriek – yungpadewon Apr 16 '19 at 19:12
  • if you would use groupby, you will have 3 groups/rows left (4, 5, 6). You can apply a custom function to these groups. Not sure what you want to achieve though. Need to see an example – tvgriek Apr 16 '19 at 19:20
  • Okay, let me work through your answer, and see if I can work towards what my end goal is. Thank you for your help so far @tvgriek – yungpadewon Apr 16 '19 at 19:21
2

I'm not entirely sure if your expected results is right - the last row seems off, 144.1 is closer to 134.2 than 155.9, so the closest_p_time should be 105?

I do this by first building a lookup of what row contains the closest value, then just selecting the price and time for those rows.

import io

d = """
x_time    expiration    x_price    p_time    p_price
 100          4          55.321     100        21
 105          4          51.120     105        25
 110          4          44.412     110        33.1
 100          5           9.1       100        3.1
 105          5           9.5       105        5.1
 110          5           8.2       110        12.1 
 100          6           122.1     100        155.9
 105          6           144.1     105        134.2 
"""

df = pd.read_csv(io.StringIO(d), delim_whitespace=True)

idx_lookup = df.apply(lambda x: (df.loc[df['expiration'] == x['expiration'], 'p_price'] - x['x_price']).abs().idxmin(), axis=1)

df['closest_p_price'] = df.loc[idx_lookup, 'p_price'].values
df['closest_p_time'] = df.loc[idx_lookup, 'p_time'].values

Which gives the dataframe

   x_time  expiration  x_price  p_time  p_price  closest_p_price  closest_p_time
0     100           4   55.321     100     21.0             33.1             110
1     105           4   51.120     105     25.0             33.1             110
2     110           4   44.412     110     33.1             33.1             110
3     100           5    9.100     100      3.1             12.1             110
4     105           5    9.500     105      5.1             12.1             110
5     110           5    8.200     110     12.1              5.1             105
6     100           6  122.100     100    155.9            134.2             105
7     105           6  144.100     105    134.2            134.2             105

Note this uses apply along axis=1, which essentially goes through each row. I have not yet thought of a way to do this any better.

Ken Syme
  • 3,532
  • 2
  • 17
  • 19
  • very cool, only difference is our closest p_time column, I had assumed that this was bench marked against the x_time so the p_time for expiration group 4 would be 100 for the first row. I may be wrong though. – Umar.H Apr 17 '19 at 15:58
  • Ken_Syme , thanks for that. I was able to understand exactly what was going on in this solution. Thank you very much! Also, thanks for the help @Datanovice – yungpadewon Apr 17 '19 at 21:04
  • @Datanovice Yeah they are a very similar idea, yours might be slightly faster on really big frames as doesn't use `apply`. I had assumed the closest_p_time was the p_time for the closest p_price, which is I think the main output difference between our solutions. – Ken Syme Apr 18 '19 at 08:49
1

Okay, my attempt. I'm using a self-defined function which I've taken from this post which turns a column into an array based of a value. I then use groupby and transform to run over your intended arrays to find your closest match.

def find_nearest(array, values):
array = np.asarray(array)

    # the last dim must be 1 to broadcast in (array - values) below.
    values = np.expand_dims(values, axis=-1) 

    indices = np.abs(array - values).argmin(axis=-1)
    return array[indices]

    ## lets use a lambda with transform to assign this back to our main df 
    df['closest_price'] = df.groupby('expiration')['x_price'].transform(lambda x : find_nearest(df['p_price'],x))
    df['closest_time'] = df.groupby('expiration')['x_time'].transform(lambda x : find_nearest(df['p_time'],x))


     print(df)
    x_time  expiration  x_price p_time  p_price closest_price   closest_time
0   100     4   55.321      100 21.0    33.1    100
1   105     4   51.120      105 25.0    33.1    105
2   110     4   44.412      110 33.1    33.1    110
3   100     5   9.100       100 3.1     12.1    100
4   105     5   9.500       105 5.1     12.1    105
5   110     5   8.200       110 12.1    5.1     110
6   100     6   122.100     100 155.9   134.2   100
7   105     6   144.100     105 134.2   134.2   105
Umar.H
  • 22,559
  • 7
  • 39
  • 74