3

I have a dataframe that looks like this:

index value
0     1
1     1
2     2
3     3
4     2
5     1
6     1

what I want is for each value to return the index of the previous smaller value, and, in addition, the index of the previous "1" value. If the value is 1 I don't need them (both values can be -1 or something).

So what I'm after is:

index value  previous_smaller_index  previous_1_index
0     1            -1                      -1
1     1            -1                      -1
2     2             1                       1
3     3             2                       1
4     2             1                       1
5     1            -1                      -1
6     1            -1                      -1

I tried using rolling, cumulative functions etc. but I couldn't figure it out. Any help would be appreciated!

Edit: SpghttCd already provided a nice solution for the "previous 1" problem. I'm looking for a nice pandas one liner for the "previous small" problem. (even though, of course, more nice and efficient solutions are welcomed for both problems)

cs95
  • 379,657
  • 97
  • 704
  • 746
Binyamin Even
  • 3,318
  • 1
  • 18
  • 45

4 Answers4

6
  • "previous_smaller_index" can be found using vectorised numpy broadcasted comparison with argmax.

  • "previous_1_index" can be solved using groupby and idxmax on a cumsummed mask.

m = df.value.eq(1)
u = np.triu(df.value.values < df.value[:,None]).argmax(1)
v = m.cumsum()

df['previous_smaller_index'] = np.where(m, -1, len(df) - u - 1)
df['previous_1_index'] = v.groupby(v).transform('idxmax').mask(m, -1)

df
   index  value  previous_smaller_index  previous_1_index
0      0      1                      -1                -1
1      1      1                      -1                -1
2      2      2                       1                 1
3      3      3                       2                 1
4      4      2                       1                 1
5      5      1                      -1                -1
6      6      1                      -1                -1

If you want these as one liners, you can scrunch a few lines together into one:

m = df.value.eq(1)
df['previous_smaller_index'] = np.where(
    m, -1, len(df) - np.triu(df.value.values < df.value[:,None]).argmax(1) - 1
)[::-1]

# Optimizing @SpghttCd's `previous_1_index` calculation a bit
df['previous_1_index'] = (np.where(
    m, -1, df.index.where(m).to_series(index=df.index).ffill(downcast='infer'))
)

df

   index  value  previous_1_index  previous_smaller_index
0      0      1                -1                      -1
1      1      1                -1                      -1
2      2      2                 1                       1
3      3      3                 1                       2
4      4      2                 1                       1
5      5      1                -1                      -1
6      6      1                -1                      -1

Overall Performance

Setup and performance benchmarking was done using perfplot. The code can be found at this gist.

enter image description here

Timings are relative (the y-scale is logarithmic).


previous_1_index Performance

Gist with relevant code.

enter image description here

cs95
  • 379,657
  • 97
  • 704
  • 746
  • It is a nice solution, and I upvoted and accepted, but I was hoping for a relatively short one liner with pure pandas functions. So I'll wait with the bounty. – Binyamin Even Jan 17 '19 at 08:01
  • @BinyaminEven Just wanted to point out I have added 1-liner versions for my solutions. – cs95 Jan 17 '19 at 18:06
  • thank you! it's an interesting solution, I need to break it apart to fully understand it. it will be interesting to compare efficiency with SpghttCd's solution. his code, especially for prev_1 is much shorter. – Binyamin Even Jan 17 '19 at 19:43
  • @BinyaminEven Keep in mind that shortness != performance. See [this answer](https://stackoverflow.com/a/50518852/4909087) where a 25 line numpy function outperforms pandas one liner. – cs95 Jan 17 '19 at 19:45
  • I didn't say his code is more efficient, actually for prev_small he uses apply and you don't. But I have a feeling that his code for prev_1 will run faster. it needs to be "timeit-ed". – Binyamin Even Jan 17 '19 at 19:48
  • that's awesome! I didn't know perfplot... can you compare only the prev_1 performance? It's pretty clear your solution as a whole is more efficient, but it will be interesting to see a comparison for prev_1. – Binyamin Even Jan 17 '19 at 20:11
  • @BinyaminEven I have added timings with an updated solution. – cs95 Jan 17 '19 at 20:38
  • 1
    ok thanks! because I promised to him, I will accept his answer, but I will happily give you the bounty (I can give it in 8 hours or so). – Binyamin Even Jan 17 '19 at 20:54
  • @W-B Sure, give me a little time. I'll have it up soon. – cs95 Jan 20 '19 at 02:28
2

You could try

df = pd.DataFrame({'value': [1,  1,  2,  3,  2,  1,  1, 2, 3, 4, 5]})

df['prev_smaller_idx'] = df.apply(lambda x: df.index[:x.name][(x.value>df.value)[:x.name]].max(), axis=1)

df['prev_1_idx'] = pd.Series(df.index.where(df.value==1)).shift()[df.value!=1].ffill()

#    value  prev_smaller_idx  prev_1_idx
#0       1               NaN         NaN
#1       1               NaN         NaN
#2       2               1.0         1.0
#3       3               2.0         1.0
#4       2               1.0         1.0
#5       1               NaN         NaN
#6       1               NaN         NaN
#7       2               6.0         6.0
#8       3               7.0         6.0
#9       4               8.0         6.0
#10      5               9.0         6.0
SpghttCd
  • 10,510
  • 2
  • 20
  • 25
1

This function should work:

def func(values, null_val=-1):
    # Initialize with arbitrary value
    prev_small = values * -2
    prev_1 = values * -2

    # Loop through values and find previous values
    for n, x in enumerate(values):
        prev_vals = values.iloc[:n]
        prev_small[n] = prev_vals[prev_vals < x].index[-1] if (prev_vals < x).any() else null_val
        prev_1[n] = prev_vals[prev_vals == 1].index[-1] if x != 1 and (prev_vals == 1).any() else null_val

    return prev_small, prev_1

df = pd.DataFrame({'value': [1,  1,  2,  3,  2,  1,  1,]})
df['previous_small'], df['previous_1'] = func(df['value'])

Output:

   value  previous_small  previous_1
0      1              -1          -1
1      1              -1          -1
2      2               1           1
3      3               2           1
4      2               1           1
5      1              -1          -1
6      1              -1          -1
busybear
  • 10,194
  • 1
  • 25
  • 42
1

Here is to do the previous_smaller_index

l=list(zip(df['index'],df.value))[::-1]

t=[]
n=len(l)
for x in l:
    if x[1]==1:
        t.append(-1)
    else:
        t.append(next(y for y in l[n-x[0]:] if y[1]<x[1])[0])
df['previous_smaller_index']=t[::-1]
df
Out[71]: 
   index  value  previous_smaller_index
0      0      1                      -1
1      1      1                      -1
2      2      2                       1
3      3      3                       2
4      4      2                       1
5      5      1                      -1
6      6      1                      -1

Get the previous 1

df['index'].where(df.value==1).ffill().where(df.value!=1,-1)
Out[77]: 
0   -1.0
1   -1.0
2    1.0
3    1.0
4    1.0
5   -1.0
6   -1.0
Name: index, dtype: float64

Assign it back

df['previous_1_index']=df['index'].where(df.value==1).ffill().where(df.value!=1,-1)



df
Out[79]: 
   index  value  previous_smaller_index  previous_1_index
0      0      1                      -1              -1.0
1      1      1                      -1              -1.0
2      2      2                       1               1.0
3      3      3                       2               1.0
4      4      2                       1               1.0
5      5      1                      -1              -1.0
6      6      1                      -1              -1.0
BENY
  • 317,841
  • 20
  • 164
  • 234
  • What is `l` here? Can you please rewrite the first part so it is in context of `df`? – cs95 Jan 20 '19 at 05:35
  • Also, when I run your code for 'previous_smaller_index', I get `t = [-1, -1, 5.0, 4.0, 5.0, -1, -1]` can you check? – cs95 Jan 20 '19 at 05:45
  • 1
    @coldspeed got it , I forget to copy full code here, Edit ! :-) sorry about that man – BENY Jan 20 '19 at 06:06
  • @coldspeed seems like still not fast enough , your solutions' performance much better – BENY Jan 20 '19 at 06:28
  • It is a nice alternative. I think optimising previous_1_index any more will be very hard without Numba. And considering OP wants "one liners", I am not sure how much they would be interested in a numba solution. – cs95 Jan 20 '19 at 06:28
  • 1
    @coldspeed Numba should be even faster . one-line is neat , but if consider performance , I think multiple line dose mean 'bad' :-) (Just my 2 cents .) – BENY Jan 20 '19 at 06:31