1

I'm trying to count the frequency of the number of occurrences a value has been higher than today's value for all dates in the past.

I've looked into using pandas rolling and cumcount functions, but still cannot figure out how to do this efficiently.

Below is a working code snippet of what I'm trying to achieve with the output displayed in the image following. My gut says that this is an extremely ugly and inefficient approach, so I was hoping to find something better.

    def freq_greater_than(r):
        smaller_date = df[df.date < r.date]
        larger_num = smaller_date[(smaller_date.num > r.num)]
        return round(len(larger_num) / len(smaller_date) * 100, 2)
    
    index = pd.date_range('2020-01-01', '2020-01-10')
    df = pd.DataFrame(dict(num=np.random.random_integers(0, 100, len(index))), index=pd.Series(index, name='date'))
    df['date'] = df.index
    df['freq_greater_than'] = df.iloc[1:].apply(freq_greater_than, axis=1)
    df.style.format({'freq_greater_than': '{:.2f}%'})

enter image description here

Olshansky
  • 5,904
  • 8
  • 32
  • 47
  • Please don't post images of data, it doesn't help, you need to post data in a format that we can copy/paste. See this [page](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). – joao Jun 20 '21 at 15:35
  • 1
    @joao, data is already there in the copy/paste format, that image represents the expected outcome. – ThePyGuy Jun 20 '21 at 15:36
  • oh, my bad, I didn't see you were generating random data – joao Jun 20 '21 at 15:40
  • @joao, it's not me, it's OP – ThePyGuy Jun 20 '21 at 15:45

3 Answers3

1

You can use .expanding():

# convert date and sort dataframe:
df["date"] = pd.to_datetime(df["date"])
df = df.sort_values(by="date")

df["freq_greater_than"] = (
    df["num"]
    .expanding(1)
    .apply(lambda x: ((x.iat[-1] < x[:-1]).sum() / (len(x) - 1)) * 100)
)
print(df)

Prints:

        date  num  freq_greater_than
0 2020-01-01   72                NaN
1 2020-01-02   28              100.0
2 2020-01-03   47               50.0
3 2020-01-04    2              100.0
4 2020-01-05   73                0.0
5 2020-01-06    8               80.0

...
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
1

Here is one way with expanding:

df["freq_gt"] = (df.num
                   .expanding()
                   .agg(lambda x: 100 * x.iloc[:-1].gt(x.iat[-1]).sum() / (x.size - 1)))

where expanding sends ever-increasing windows over num column and x represents the windows. Then we look at the values except for the last one via iloc[:-1] and compare it (gt) with the last one (iat[-1]). Then sum counts the number of Trues resulted from comparison which gives the count of those that are greater. Dividing it by the window size minus 1 to exclude the current value and 100 * gives the ratio,

to get

            num     freq_gt
date
2020-01-01   46         NaN
2020-01-02   35  100.000000
2020-01-03   80    0.000000
2020-01-04    4  100.000000
2020-01-05   43   50.000000
2020-01-06    4   80.000000
2020-01-07    7   66.666667
2020-01-08   91    0.000000
2020-01-09   95    0.000000
2020-01-10   99    0.000000
Mustafa Aydın
  • 17,645
  • 4
  • 15
  • 38
1

Try with expanding apply + mean over gt comparison:

df['freq_greater_than'] = (
    df['num'].expanding()
        .apply(lambda s: s.iloc[:-1].gt(s.iloc[-1]).mean() * 100)
)

df:

            num       date  freq_greater_than
date                                         
2020-01-01   72 2020-01-01                NaN
2020-01-02   28 2020-01-02         100.000000
2020-01-03   47 2020-01-03          50.000000
2020-01-04    2 2020-01-04         100.000000
2020-01-05   73 2020-01-05           0.000000
2020-01-06    8 2020-01-06          80.000000
2020-01-07   87 2020-01-07           0.000000
2020-01-08   94 2020-01-08           0.000000
2020-01-09   40 2020-01-09          62.500000
2020-01-10   25 2020-01-10          77.777778

Optional map for string formatting:

df['freq_greater_than'] = (
    df['num'].expanding()
        .apply(lambda s: s.iloc[:-1].gt(s.iloc[-1]).mean() * 100)
        .map('{:.2f}%'.format)
)
            num       date freq_greater_than
date                                        
2020-01-01   72 2020-01-01              nan%
2020-01-02   28 2020-01-02           100.00%
2020-01-03   47 2020-01-03            50.00%
2020-01-04    2 2020-01-04           100.00%
2020-01-05   73 2020-01-05             0.00%
2020-01-06    8 2020-01-06            80.00%
2020-01-07   87 2020-01-07             0.00%
2020-01-08   94 2020-01-08             0.00%
2020-01-09   40 2020-01-09            62.50%
2020-01-10   25 2020-01-10            77.78%

DataFrame Constructor and Imports:

import pandas as pd

index = pd.date_range('2020-01-01', '2020-01-10')
df = pd.DataFrame({'num': [72, 28, 47, 2, 73, 8, 87, 94, 40, 25]},
                  index=pd.Series(index, name='date'))
df['date'] = df.index
Henry Ecker
  • 34,399
  • 18
  • 41
  • 57