1

I have a table like this:

import pandas as pd
df = pd.DataFrame({
    "day": [1, 2, 3, 4, 5, 6],
    "tmin": [-2, -3, -1, -4, -4, -2]
})

I want to create a column like this:

df['days_under_0_until_now'] = [1, 2, 3, 4, 5, 6]
df['days_under_-2_until_now'] = [1, 2, 0, 1, 2, 3]
df['days_under_-3_until_now'] = [0, 1, 0, 1, 2, 0]

So days_under_X_until_now means how many consecutive days until now tmin was under or equals X

I'd like to avoid do this with loops since the data is huge. Is there an alternative way to do it?

santos82h
  • 452
  • 5
  • 15

1 Answers1

1

For improve performance avoid using groupby compare values of column to list and then use this solution for count consecutive Trues:

vals = [0,-2,-3]

arr = df['tmin'].to_numpy()[:, None] <= np.array(vals)[ None, :]
cols = [f'days_under_{v}_until_now' for v in vals]
df1 = pd.DataFrame(arr, columns=cols, index=df.index)

b = df1.cumsum()
df = df.join(b.sub(b.mask(df1).ffill().fillna(0)).astype(int))
print (df)
   day  tmin  days_under_0_until_now  days_under_-2_until_now  \
0    1    -2                       1                        1   
1    2    -3                       2                        2   
2    3    -1                       3                        0   
3    4    -4                       4                        1   
4    5    -4                       5                        2   
5    6    -2                       6                        3   

   days_under_-3_until_now  
0                        0  
1                        1  
2                        0  
3                        1  
4                        2  
5                        0  
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252