Pandas get postion of last value based on condition for each column (efficiently)

Question

I want to get the information in which row the value 1 occurs last for each column of my dataframe. Given this last row index I want to calculate the "recency" of the occurence. Like so:

>> df = pandas.DataFrame({"a":[0,0,1,0,0]," b":[1,1,1,1,1],"c":[1,0,0,0,1],"d":[0,0,0,0,0]})
>> df
   a  b  c  d
0  0  1  1  0
1  0  1  0  0
2  1  1  0  0
3  0  1  0  0
4  0  1  1  0

Desired result:

>> calculate_recency_vector(df)
[3,1,1,None]

The desired result shows for each column "how many rows ago" the value 1 appeared for the last time. Eg for the column a the value 1 appears last in the 3rd-last row, hence the recency of 3 in the result vector. Any ideas how to implement this?

Edit: to avoid confusion, I changed the desired output for the last column from 0 to None. This column has no recency because the value 1 does not occur at all.

Edit II: Thanks for the great answers! I have to calculate this recency vector approx. 150k times on dataframes shaped (42,250). A more efficient solution would be much appreciated.

and why the recency for the 3rd column should be `1` (in your output) ? — RomanPerekhrest, Dec 25 '22 at 11:32
Because it is the (first)-last (like second-last, third-last, etc) row. Or in other words I start counting at `1` and not `0`. Of course you could start counting from zero like in array slicing. That is just a choice of preference. — Viktor, Dec 25 '22 at 11:45

Viktor · Accepted Answer · 2022-12-28T14:44:53.223

A loop-less solution which is faster & cleaner:

>> def calculate_recency_for_one_column(column: pd.Series) -> int:
>>     non_zero_values_of_col = column[column.astype(bool)]
>>     if non_zero_values_of_col.empty:
>>         return 0
>>     return len(column) - non_zero_values_of_col.index[-1]

>> df = pd.DataFrame({"a":[0,0,1,0,0],"b":[1,1,1,1,1],"c":[1,0,0,0,1],"d":[0,0,0,0,0]})

>> df.apply(lambda column: calculate_recency_for_one_column(column),axis=0)

a    3
b    1
c    1
d    0
dtype: int64

Sidenote: Using pd.apply() is slow (SO explanation). There exist faster solutions like using np.where or using apply(...,raw=True). See this question for details.

score 0 · Answer 2 · answered Dec 25 '22 at 12:10

With this example dataframe, you can define a function as follow:

def calculate_recency_vector(df: pd.DataFrame, condition: int) -> list:
    recency_vector = []

    for col in df.columns:
        last = 0
        for i, y in enumerate(df[col].to_list()):
            if y == condition:
                last = i

        recency = len(df[col].to_list()) - last
        if recency == len(df[col].to_list()):
            recency = None

        recency_vector.append(recency)

    return recency_vector

Running the function, it will return this:

calculate_recency_vector(df, 1)
[3, 1, 1, None]

score 0 · Answer 3 · answered Dec 25 '22 at 17:07

This

df = pandas.DataFrame({"a":[0,0,1,0,0]," b":[1,1,1,1,1],"c":[1,0,0,0,1],"d":[0,0,0,0,0]})
df.apply(lambda x : ([df.shape[0] - i for i ,v in x.items() if v==1] or [None])[-1], axis=0)

produces the desired output as a pd.Series , with the only diffrence that the result is float and None is replaced by pandas Nan, u could then take the desired column

Pandas get postion of last value based on condition for each column (efficiently)

3 Answers3