9

So I created this post regarding my problem 2 days ago and got an answer thankfully.

I have a data made of 20 rows and 2500 columns. Each column is a unique product and rows are time series, results of measurements. Therefore each product is measured 20 times and there are 2500 products.

This time I want to know for how many consecutive rows my measurement result can stay above a specific threshold. AKA: I want to count the number of consecutive values that is above a value, let's say 5.

A = [1, 2, 6, 8, 7, 3, 2, 3, 6, 10, 2, 1, 0, 2] We have these values in bold and according to what I defined above, I should get NumofConsFeature = 3 as the result. (Getting the max if there are more than 1 series that meets the condition)

I thought of filtering using .gt, then getting the indexes and using a loop afterwards in order to detect the consecutive index numbers but couldn't make it work.

In 2nd phase, I'd like to know the index of the first value of my consecutive series. For the above example, that would be 3. But I have no idea of how for this one.

Thanks in advance.

meliksahturker
  • 922
  • 2
  • 11
  • 20
  • Consider accepting the answer on your previous post - you can do this by clicking the checkmark next to the answer. – andrew_reece Oct 05 '18 at 18:28
  • Should `6, 8, 7` be `6, 7, 8`? – andrew_reece Oct 05 '18 at 19:01
  • Do you mean the index of the first value should be 2 and not 3? For `A=[1,2,6,7,8...]` the index of 6 is 2. – andrew_reece Oct 05 '18 at 19:22
  • starting from 0, you are right yes it should be 2. and no, `6, 8, 7` has no reason to be ordered from smallest to largest – meliksahturker Oct 07 '18 at 19:52
  • Ok - but `6,8,7` is not a consecutive series. How do you determine the window within which to sort sub-sequences? – andrew_reece Oct 07 '18 at 19:55
  • consecutive as passing the test of being over a threshold. we are checking whether it's over 5, so it goes like: `[FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, FALSE FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE]` – meliksahturker Oct 07 '18 at 20:06
  • I see. Thanks for clarifying. – andrew_reece Oct 07 '18 at 20:25
  • Did either of the posted solutions work for you? Consider accepting the one that worked best for you? – Divakar Oct 28 '18 at 04:47
  • Yes, though I proceeded using the solution here https://stackoverflow.com/questions/52717996/how-can-i-count-the-number-of-consecutive-trues-in-a-dataframe since using boolean seemed more practical for other features too. Thanks a lot for your help. – meliksahturker Oct 29 '18 at 11:50

5 Answers5

6

Here's another answer using only Pandas functions:

A = [1, 2, 6, 8, 7, 3, 2, 3, 6, 10, 2, 1, 0, 2]
a = pd.DataFrame(A, columns = ['foo'])
a['is_large'] = (a.foo > 5)
a['crossing'] = (a.is_large != a.is_large.shift()).cumsum()
a['count'] = a.groupby(['is_large', 'crossing']).cumcount(ascending=False) + 1
a.loc[a.is_large == False, 'count'] = 0

which gives

    foo  is_large  crossing  count
0     1     False         1      0
1     2     False         1      0
2     6      True         2      3
3     8      True         2      2
4     7      True         2      1
5     3     False         3      0
6     2     False         3      0
7     3     False         3      0
8     6      True         4      2
9    10      True         4      1
10    2     False         5      0
11    1     False         5      0
12    0     False         5      0
13    2     False         5      0

From there on you can easily find the maximum and its index.

Bart
  • 190
  • 2
  • 8
3

There is simple way to do that.
Lets say your list is like: A = [1, 2, 6, 8, 7, 6, 8, 3, 2, 3, 6, 10,6,7,8, 2, 1, 0, 2]
And you want to find how many consecutive series that has values bigger than 6 and length of 5. For instance, here your answer is 2. There is two series that has values bigger than 6 and length of the series are 5. In python and pandas we do that like below:

 condition = (df.wanted_row > 6) & \
            (df.wanted_row.shift(-1) > 6) & \
            (df.wanted_row.shift(-2) > 6) & \
            (df.wanted_row.shift(-3) > 6) & \
            (df.wanted_row.shift(-4) > 6)

consecutive_count = df[condition].count().head(1)[0]
Mehmet nuri
  • 840
  • 7
  • 7
  • Here, I wrote it in nice syntax: `condition = eval(' & '.join([f'(pos.shift({x})>6)' for x in range(num_consecutive)])) = 1` – Landmaster Aug 04 '21 at 14:26
0

You can apply diff() on your Series, and then just count the number of consecutive entries where the difference is 1 and the actual value is above your cutoff. The largest count is the maximum number of consecutive values.

First compute diff():

df = pd.DataFrame({"a":[1, 2, 6, 7, 8, 3, 2, 3, 6, 10, 2, 1, 0, 2]})
df['b'] = df.a.diff()

df
     a    b
0    1  NaN
1    2  1.0
2    6  4.0
3    7  1.0
4    8  1.0
5    3 -5.0
6    2 -1.0
7    3  1.0
8    6  3.0
9   10  4.0
10   2 -8.0
11   1 -1.0
12   0 -1.0
13   2  2.0

Now count consecutive sequences:

above = 5
n_consec = 1
max_n_consec = 1

for a, b in df.values[1:]:
    if (a > above) & (b == 1):
        n_consec += 1
    else: # check for new max, then start again from 1
        max_n_consec = max(n_consec, max_n_consec)
        n_consec = 1

max_n_consec
3
andrew_reece
  • 20,390
  • 3
  • 33
  • 58
  • Thank you for your response, however I have many many columns and they will only increase in time since they are unique products. Therefore I cannot write codes for every series/product. So I need a more general solution, for unlimited number of columns. – meliksahturker Oct 07 '18 at 19:54
0

Here's one with maxisland_start_len_mask -

# https://stackoverflow.com/a/52718782/ @Divakar
def maxisland_start_len_mask(a, fillna_index = -1, fillna_len = 0):
    # a is a boolean array

    pad = np.zeros(a.shape[1],dtype=bool)
    mask = np.vstack((pad, a, pad))

    mask_step = mask[1:] != mask[:-1]
    idx = np.flatnonzero(mask_step.T)
    island_starts = idx[::2]
    island_lens = idx[1::2] - idx[::2]
    n_islands_percol = mask_step.sum(0)//2

    bins = np.repeat(np.arange(a.shape[1]),n_islands_percol)
    scale = island_lens.max()+1

    scaled_idx = np.argsort(scale*bins + island_lens)
    grp_shift_idx = np.r_[0,n_islands_percol.cumsum()]
    max_island_starts = island_starts[scaled_idx[grp_shift_idx[1:]-1]]

    max_island_percol_start = max_island_starts%(a.shape[0]+1)

    valid = n_islands_percol!=0
    cut_idx = grp_shift_idx[:-1][valid]
    max_island_percol_len = np.maximum.reduceat(island_lens, cut_idx)

    out_len = np.full(a.shape[1], fillna_len, dtype=int)
    out_len[valid] = max_island_percol_len
    out_index = np.where(valid,max_island_percol_start,fillna_index)
    return out_index, out_len

def maxisland_start_len(a, trigger_val, comp_func=np.greater):
    # a is 2D array as the data
    mask = comp_func(a,trigger_val)
    return maxisland_start_len_mask(mask, fillna_index = -1, fillna_len = 0)

Sample run -

In [169]: a
Out[169]: 
array([[ 1,  0,  3],
       [ 2,  7,  3],
       [ 6,  8,  4],
       [ 8,  6,  8],
       [ 7,  1,  6],
       [ 3,  7,  8],
       [ 2,  5,  8],
       [ 3,  3,  0],
       [ 6,  5,  0],
       [10,  3,  8],
       [ 2,  3,  3],
       [ 1,  7,  0],
       [ 0,  0,  4],
       [ 2,  3,  2]])

# Per column results
In [170]: row_index, length = maxisland_start_len(a, 5)

In [172]: row_index
Out[172]: array([2, 1, 3])

In [173]: length
Out[173]: array([3, 3, 4])
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • Thank you for your marvelous reply. Although I've had difficulties understanding it, it worked great on an example array a. However, when I've tried with the data, it gave me an error which reads `File "", line 74, in maxisland_start_len IndexError: index 126 out-of-bounds in maximum.reduceat [0, 126)` and the line 74 is `max_island_percol_len = np.maximum.reduceat(island_lens, grp_shift_idx[:-1])` I'm new to this, so the syntax is bit alien for me. I'm trying to understand, but without the comments it wasn't very easy. Can you elaborate in more detail? Thank you in advance. – meliksahturker Oct 07 '18 at 14:25
  • @crinix Is the input a NumPy array or something else? – Divakar Oct 07 '18 at 14:33
  • It was Pandas DataFrame however was converted into Numpy array with `na = df.values` – meliksahturker Oct 07 '18 at 14:57
  • @crinix What's the shape and datatype of `na`? – Divakar Oct 07 '18 at 15:00
  • Shape: (34, 288) Dtype: float64 – meliksahturker Oct 07 '18 at 15:04
  • @crinix I am not sure really. `maxisland_start_len(na, 5)` should have worked. – Divakar Oct 07 '18 at 15:07
  • And when I tried it with random data on Anaconda Prompt, it gives the error: `Traceback (most recent call last): File "asd.py", line 30, in print(maxisland_start_len(a, 5)) File "asd.py", line 19, in maxisland_start_len scale = island_lens.max()+1 File "C:\ProgramData\Anaconda3\lib\site-packages\numpy\core\_methods.py", line 26, in _amax return umr_maximum(a, axis, None, out, keepdims) ValueError: zero-size array to reduction operation maximum which has no identity` – meliksahturker Oct 07 '18 at 15:50
  • @crinix Make sure that you are using the array data `na` and not something unknown `a` with : `maxisland_start_len(na, 5)`. – Divakar Oct 07 '18 at 15:53
  • OK I made sure and got a finding. Whenever I set the threshold below 5, it works and gives me the result, however whenever I set threshold > 4, it results in error. For example: if its 5, the error is `IndexError: index 283 out-of-bounds in maximum.reduceat [0, 283)` and if it is set to 10, then the error is: `IndexError: index 126 out-of-bounds in maximum.reduceat [0, 126)` – meliksahturker Oct 07 '18 at 16:34
  • I even tried to divide the whole table by 10, and set the threshold to 1. The resulting error was the same as when the threshold was 10, before I divided the data by 10: `IndexError: index 126 out-of-bounds in maximum.reduceat [0, 126)` – meliksahturker Oct 07 '18 at 16:39
  • `@Divakar` I removed one third of the columns(data) and it works without giving error however it calculates wrong. I highly appreciated your help however I am having hard time understanding and working on your method. So can you at least help me with comments of each line so that I can diagnose and then improve and make it work perhaps? Thanks in advance. – meliksahturker Oct 08 '18 at 14:17
  • @crinix I am guessing the issue was with all 0s columns. Fixed in your newest asked question. – Divakar Oct 09 '18 at 10:27
  • @crinix Updated the solution here with help from solution to the newest question. – Divakar Oct 09 '18 at 11:21
0

Here's how I did it using numpy:

import pandas as pd
import numpy as np


df = pd.DataFrame({"a":[1, 2, 6, 7, 8, 3, 2, 3, 6, 10, 2, 1, 0, 2]})


consecutive_steps = 2
marginal_price = 5

assertions = [(df.loc[:, "a"].shift(-i) < marginal_price) for i in range(consecutive_steps)]
condition = np.all(assertions, axis=0)

consecutive_count = df.loc[condition, :].count()
print(consecutive_count)

which yields 6.

Cord Kaldemeyer
  • 6,405
  • 8
  • 51
  • 81