3

In a pandas time-series, I am trying to find a combination measure of a threshold with a duration.

For instance, we want the number of periods > 5 minutes, where column ['pct'] is below 80

The dataframe looks like this:

timestamp pct
27-05-2021 10:11 95
27-05-2021 10:12 94
27-05-2021 10:13 80
27-05-2021 10:14 94
27-05-2021 10:15 80
27-05-2021 10:16 80
27-05-2021 10:17 80
27-05-2021 10:18 80
27-05-2021 10:19 80
27-05-2021 10:20 91
27-05-2021 10:21 NaN
27-05-2021 10:22 80
27-05-2021 10:23 80
27-05-2021 10:24 80
27-05-2021 10:25 80
27-05-2021 10:26 94

It would thus need to identify 1 period(as we do not care to include NaN values)

I've gotten some of the way with the post from Ben B, and the answer from Alain T here: How to count consecutive periods in a timeseries above/below threshold?

I've attached an ugly image from microsoft paint to illustrate the problem Ugly microsoft paint illustration

NB: It is quite a big dataframe, so I am not sure that iterating over the dataframe is the best idea, but any help is very much appreciated.

  • 1
    So, in the end you'd like to get the number `1` as the count, or one filtered dataframe, or a list of dataframes rows of which satisfy the conditions? – Mustafa Aydın May 27 '21 at 08:56
  • In the end i just want the number of counts on how many periods satisfy the condition. But If i am left with a dataframe that is filtered, I also can work my way from there... – Jesper Mølgaard May 27 '21 at 09:08
  • 1
    I think one easy solution, that I almost have working is filtering by simple condition, so you get booleans, then to do df.cumsum() to count the number of minutes, but i don't know how to get the count, and reset the count when it hits a 'False' – Jesper Mølgaard May 27 '21 at 09:16
  • 2
    yes, similar thoughts here, posted an answer hope it helps – Mustafa Aydın May 27 '21 at 09:20

1 Answers1

4

You can groupby the consecutive 80s in the dataframe and then check the condition in each group with a list comprehension and get its length:

# first is `pct` column's threshold, other is minute threshold for `timestamp`
value_thre = 80
minute_thre = 3

# groupby by consecutive `value_thre`s
grouper = df.groupby(df.pct.le(value_thre).diff().ne(0).cumsum())

# look at the time difference between last and first timestamp
# also ensure no `pct` value exceeds the value threshold
condition = lambda gr: (gr.pct.max() <= value_thre
                        and gr.timestamp.iloc[-1] - gr.timestamp.iloc[0] > pd.Timedelta(f"{minute_thre} min"))

# filter the grouper and get the length
result = len([g for _, g in grouper if condition(g)])

to get

>>> result
1
Mustafa Aydın
  • 17,645
  • 4
  • 15
  • 38
  • 1
    Awesome. I'm testing out now. Thanks so much! – Jesper Mølgaard May 27 '21 at 09:23
  • I seem to be running into a problem since my 'timeseries' column is also my index. I guess the simple way to solve this would be to make a new index, but since timeseries index gives me a lot of possibilities, would it be a problem to use the index column instead? I get this error when I try: AttributeError: 'DatetimeIndex' object has no attribute 'iloc' – Jesper Mølgaard May 27 '21 at 09:28
  • 1
    @JesperMølgaard In that case, I'd say you can replace `gr.timestamp.iloc` in the `condition` to `gr.index` to reach the index directly and compare there (from what I understood from your comment). – Mustafa Aydın May 27 '21 at 09:36
  • 1
    If that fails, you can try `gr.timestamp[-1]` and `0` respectively, i.e., without the `iloc` (from what I understood from the traceback). – Mustafa Aydın May 27 '21 at 09:37
  • 1
    @JesperMølgaard good to hear! Is the rest working? – Mustafa Aydın May 27 '21 at 10:01
  • I'm still testing, will report back ASAP :) – Jesper Mølgaard May 27 '21 at 10:14
  • Two problems so far: I am still working on getting it to ignore NaN in the correct way. Also, I can't adapt your solution to get all values below a certain threshold, I need periods with df[df['pct'] <= value], not only df.['sat'].eq(value) Does that make sense? – Jesper Mølgaard May 27 '21 at 11:49
  • 1
    @JesperMølgaard For the second issue, changing `df.pct.eq` to `df.pct.le` should do, i think (from "`eq`ual" to "`l`ess than or `e`qual"). For the `NaN`, as far as I understand, `df = df.dropna()` before all these should work. Can you clarify how the `NaN`s should be treated exactly? – Mustafa Aydın May 27 '21 at 11:54
  • 1
    How to handle NaN is a good question. For now, I would like to ignore them, so that they dont count in any direction. If a sequence of < values appear, with NaN interspaced, i need to reset the counter everytime we hit a NaN, and NaN in either end of a sequence should also not count. If need be, I can interpolate at a later time. One thing persists: It seems to count the time where values are greater than 'threshold'? If I change to .ge(value_thres) it seems to do better. But it is probably thrown of because the value column can look like: [...,80,-,93,-,92,-,92,-,91,-,90,-,91,-,80,...] – Jesper Mølgaard May 27 '21 at 12:44
  • 1
    @JesperMølgaard Oh, for the threshold issue: sorry for that. I modified the condition so that no group with a value bigger than the threshold is falsely accepted even though it satisfies the time condition. It may seem like we are redundantly comparing against the`value_thre` two times but the first one is to be able to group in the first place and the other one is to reject large number-including groups, as you remarked. – Mustafa Aydın May 27 '21 at 13:04
  • 1
    @JesperMølgaard Along with this modification, maybe you can fill the `NaN`s with a very large number e.g., inf, to assure they don't get counted at all, anywhere? That is, `df.pct = df.pct.fillna(np.inf)`. – Mustafa Aydın May 27 '21 at 13:07
  • 1
    Almost there. I am working through it now!! It's almost coming together :D – Jesper Mølgaard May 27 '21 at 13:18
  • 1
    It works! I just tested it on some of my data, and it seems to produce the correct results! Thanks a million!! – Jesper Mølgaard May 27 '21 at 13:28
  • Hi @MustafaAydın, this solution works with no change required. It is almost too good to be true. But I cannot understand how it works. Can you refer some background articles so that I can understand this code better? Thanks T. – Tanjil Oct 21 '21 at 10:35
  • Hi @Tanjil. I'll try to explain 3 pieces in the code: – Mustafa Aydın Oct 21 '21 at 12:55
  • `grouper = ...` line: The `.diff().ne(0).cumsum()` is a somewhat usual trick in pandas to get `itertools.groupby` behaviour. You'd like to group your values in a consecutive manner and start anew group whenever your grouping condition is intervened. Let's say you'd like to group by "1 or not" in a series. The `s.groupby(s.eq(1))` approach groups *globally* but `s.groupby(s.eq(1).diff().ne(0).cumsum())` groups *locally*. (You can try these on e.g., `s = pd.Series([0, 1, 1, 1, 0, 0, 1, 1, 0, 0])`). – Mustafa Aydın Oct 21 '21 at 12:56
  • Reflecting this to the question: we'd like to get consecutive streams of values less than or equal to 80; therefore our condition is `s.le(80)` and diff-ne-cumsum follows. Links: https://pandas.pydata.org/docs/user_guide/groupby.html, https://stackoverflow.com/a/68611243/9332187. – Mustafa Aydın Oct 21 '21 at 12:56
  • `condition = ...` line: Now that we have the groups, we'd like to *filter* them because we won't keep every group; note that as a side effect of "values are <= 80 *or not*", we possibly have groups that have values bigger than 80! There we traded-off consecutive grouping vs false positives. But the first line in that `lambda` handles that: we condition that the maximum value in the group shall not exceed 80. This will eliminate those false positive-including groups. For timestamp condition: since `gr` is a `DataframeGroupby` object, we accessed `pct` column and we can also access `timestamp`.. – Mustafa Aydın Oct 21 '21 at 13:01
  • ..column to check if the diff between last element (`iloc[-1]`) and first `iloc[0]` is above the threshold. Then, we `and` these two value & time thresholds to get the condition function. Links: `iloc`: https://pandas.pydata.org/docs/user_guide/indexing.html#selection-by-position, `Timedelta` where you can pass human readable strings! https://pandas.pydata.org/docs/reference/api/pandas.Timedelta.html, and the groupby link above again. – Mustafa Aydın Oct 21 '21 at 13:03
  • `result = ...` line: As [mentioned](https://pandas.pydata.org/docs/user_guide/groupby.html#iterating-through-groups) in groupby documentation, a groupby object is iterable and when iterated, yields `key, group` pairs. We then ignore `key` part via `_` in `for` and subject the `g` the group to our `condition`; then groups that satisfy that will form a list. Length of the list is the desired result. I hope these shed a bit of light into what's going on in the code. – Mustafa Aydın Oct 21 '21 at 13:08