Speed up search for nearest upper and lower value in large pandas dataframe

Question

My dataframe looks similar to this example below (just with way more entries). I want to obtain the nearest upper and lower number for a given value, for each group.

For example for a value of 13. I would like to obtain a new dataframe similar to:

I already tried the solution from Ivo Merchiers in How do I find the closest values in a Pandas series to an input number? using groupby and apply to run it for the different groups.

def find_neighbours(value):
  exactmatch=df[df.num==value]
  if !exactmatch.empty:
      return exactmatch.index
  else:
      lowerneighbour_ind = df[df.num<value].num.idxmax()
      upperneighbour_ind = df[df.num>value].num.idxmin()
      return [lowerneighbour_ind, upperneighbour_ind]

df=df.groupby('a').apply(find_neighbours, 13)

But since my dataset has around 16 million lines this procedure takes extremely long. Is there possibly a faster way to obtain a solution?

Edit Thanks for your answers. I forgot to add some info. If a close number appears multiple times I would like to have all lines transfered to the new dataframe. And when there is only one upper (lower) and no lower (upper) neighbour, this lines should be ignored.

Leads for 13 to this:

Thanks for your help!

BENY · Accepted Answer · 2020-08-14T20:00:28.177

3

Yes we can speed it up

v=13

s=(df.b-v)
t=s.abs().groupby([df.a,np.sign(s)]).transform('min')
df1=df.loc[s.abs()==t]
df1=df1[df1.b.sub(v).groupby(df.a).transform('nunique')>1]
df1
Out[102]: 
      a   b
1   600  12
2   600  15
5   700  11
6   700  19
9   900  12
10  900  14
11  900  14

edited Aug 14 '20 at 20:00

answered Aug 14 '20 at 18:27

BENY

317,841
20
164
234

Thank you. This is clearly faster than my previous approach. Do you have an idea how to implement my Edit? – mettwurstolaf Aug 14 '20 at 19:12
is there a way to ignore the lines when there is only one of the nearest numbers given? Like the value 800 in column a in my example? There is no lower value, therefore I would also like to exclude the upper value. – mettwurstolaf Aug 14 '20 at 19:25
Works almost perfect. One minor problem. If you add line 7 (800, 14) a second time to the table. The code returns this lines although the lower value is missing. – mettwurstolaf Aug 14 '20 at 19:52
@mettwurstolaf ok , i think use nunique should be ok ` – BENY Aug 14 '20 at 20:00

score 1 · Answer 2 · answered Aug 14 '20 at 18:36

1

try this

def neighbours(x):
    d = (df.b-x)
    return df.loc[[d[d==d[d>0].min()].index[0], d[d==d[d<0].max()].index[0]]]
neighbours(13)

answered Aug 14 '20 at 18:36

Kuldip Chaudhari

1,112
4
8

Speed up search for nearest upper and lower value in large pandas dataframe

2 Answers2