2

My dataframe looks similar to this example below (just with way more entries). I want to obtain the nearest upper and lower number for a given value, for each group.

a    b  
600  10
600  12
600  15
600  17
700   8
700  11
700  19

For example for a value of 13. I would like to obtain a new dataframe similar to:

a    b  
600  12
600  15
700  11
700  19

I already tried the solution from Ivo Merchiers in How do I find the closest values in a Pandas series to an input number? using groupby and apply to run it for the different groups.

def find_neighbours(value):
  exactmatch=df[df.num==value]
  if !exactmatch.empty:
      return exactmatch.index
  else:
      lowerneighbour_ind = df[df.num<value].num.idxmax()
      upperneighbour_ind = df[df.num>value].num.idxmin()
      return [lowerneighbour_ind, upperneighbour_ind]

df=df.groupby('a').apply(find_neighbours, 13)

But since my dataset has around 16 million lines this procedure takes extremely long. Is there possibly a faster way to obtain a solution?

Edit Thanks for your answers. I forgot to add some info. If a close number appears multiple times I would like to have all lines transfered to the new dataframe. And when there is only one upper (lower) and no lower (upper) neighbour, this lines should be ignored.

a    b  
600  10
600  12
600  15
600  17
700   8
700  11
700  19
800  14
800  15
900  12
900  14
900  14

Leads for 13 to this:

a    b  
600  12
600  15
700  11
700  19
900  12
900  14
900  14

Thanks for your help!

2 Answers2

3

Yes we can speed it up

v=13

s=(df.b-v)
t=s.abs().groupby([df.a,np.sign(s)]).transform('min')
df1=df.loc[s.abs()==t]
df1=df1[df1.b.sub(v).groupby(df.a).transform('nunique')>1]
df1
Out[102]: 
      a   b
1   600  12
2   600  15
5   700  11
6   700  19
9   900  12
10  900  14
11  900  14
BENY
  • 317,841
  • 20
  • 164
  • 234
  • Thank you. This is clearly faster than my previous approach. Do you have an idea how to implement my Edit? – mettwurstolaf Aug 14 '20 at 19:12
  • is there a way to ignore the lines when there is only one of the nearest numbers given? Like the value 800 in column a in my example? There is no lower value, therefore I would also like to exclude the upper value. – mettwurstolaf Aug 14 '20 at 19:25
  • Works almost perfect. One minor problem. If you add line 7 (800, 14) a second time to the table. The code returns this lines although the lower value is missing. – mettwurstolaf Aug 14 '20 at 19:52
  • @mettwurstolaf ok , i think use nunique should be ok ` – BENY Aug 14 '20 at 20:00
1

try this

def neighbours(x):
    d = (df.b-x)
    return df.loc[[d[d==d[d>0].min()].index[0], d[d==d[d<0].max()].index[0]]]
neighbours(13)
Kuldip Chaudhari
  • 1,112
  • 4
  • 8