pandas Dataframe Consistently falling column values

Question

I am trying to Find the number of consistently falling values, which form a part of a column in my Pandas dataframe (df), columnname is "Values" and the data snippet is given below:

# sample data
time    values
11:55   0.940353
12:00   0.919144
12:05   0.909454
12:10   0.904968
12:15   0.867957
12:20   0.801426
12:25   0.794733
12:30   0.770106
12:35   0.741985
12:40   0.671444
12:45   0.558297
12:50   0.496972
12:55   0.457803
13:00   0.446388
13:05   0.430217
13:10   0.379902
13:15   0.321828
13:20   0.298304
13:25   0.442079
13:30   0.634764

Note: in the above sample data the values from 11:55 to 13:20 are of interest to me. I need to report the # of falling values (18 in this case), and then the %age of lower value compared to %age of starting/Higher Value.

I have tried to iterate my df using iterrows "for index, row in df.iloc[20:].iterrows():" # my starting row being 20.
I then tried to use a temp variable for comparison, but this is not giving me the desired results.

TIA

Why aren't 11:55 and 12:00 of interest? They are part of the decreasing series. And what do you mean by "%age of lower value compared to %age of starting Value"? Please provide your expected output. — andrew_reece, Jul 21 '18 at 03:39
yes Andrew you are Right, 11:55 To 12:00 are of Interest. I stand corrected — Suraj_j, Jul 21 '18 at 03:43
in the above sample Data the lower value is just 31.7 % of the Higher Value.. I Can calculate these Only if I have the Start as well as End Value — Suraj_j, Jul 21 '18 at 03:53
If you have separate, non-consecutive sections of decreasing values, it would be helpful to update your example data to demonstrate this condition. — andrew_reece, Jul 21 '18 at 04:56
do you want aggregate information on the group level, or per row information? — Quickbeam2k1, Jul 21 '18 at 06:18
@suraj_Blore did any of the answers solve the problem? If so, do you mind selecting the best answer? — JAponte, Jul 26 '18 at 01:44

andrew_reece · Answer 1 · 2018-07-21T05:50:10.670

Here's a solution that accounts for a couple of edge cases:
1. Non-consecutive groups of descending series
2. Last value being part of a descending series

I've updated OP data to create 3 groupings of descending series (see Data section below), to demonstrate edge case coverage.

The basic strategy is to create named groups for each descending series, and then use groupby() operations to compute the desired attributes.

First create a dec column that tracks descending series:

df["dec"] = df["values"].diff().lt(0)

Now make group assignments. The assign_group() function basically checks the dec value of its neighbors and determines its group based on neighbor status.

groups = ["a", "b", "c", "d", "e"]
groups2 = ["a", "b", "c", "d", "e"]
orig_len = len(groups)

def assign_group(x, prev):
    ix = orig_len - len(groups2)
    try:
        if ~x.dec & df.iloc[x.name + 1].dec:
            return groups2.pop(0) 
        elif df.iloc[x.name + 1].dec | x.dec & ~df.iloc[x.name + 1].dec:
            return groups[ix-1]
        else:
            return None
    except IndexError:
        if df.iloc[x.name - 1].dec:
            return groups[ix-1]
        return None

df["grp"] = df.apply(assign_group, args=(prev,), axis=1)

^ Note: The groups variable can contain any sequence of values - I've just used a few letters for demonstration, but for a much larger dataset some large sequence of numbers might be more appropriate.

Now the final groupby() is simple:

df.groupby("grp")["values"].agg(["count", lambda x: x.iloc[-1] / x.iloc[0] * 100])

     count   <lambda>
grp                  
a       10  71.403399
b        8  29.830400
c        2   0.000000

df
     time    values    dec   grp
0   11:55  0.940353  False     a
1   12:00  0.919144   True     a
2   12:05  0.909454   True     a
3   12:10  0.904968   True     a
4   12:15  0.867957   True     a
5   12:20  0.801426   True     a
6   12:25  0.794733   True     a
7   12:30  0.770106   True     a
8   12:35  0.741985   True     a
9   12:40  0.671444   True     a
10  12:45  1.000000  False     b
11  12:50  0.496972   True     b
12  12:55  0.457803   True     b
13  13:00  0.446388   True     b
14  13:05  0.430217   True     b
15  13:10  0.379902   True     b
16  13:15  0.321828   True     b
17  13:20  0.298304   True     b
18  13:25  0.442079  False  None
19  13:30  0.634764  False     c
20  13:35  0.000000   True     c

Correctness checks:

Group a first value: 0.940353, last value: 0.671444/, 0.671444/0.940353 = 0.7140339851

Group b first value: 1.0, last value: 0.298304, 0.298304/1.000000 = 0.298304

Output from original OP data:

     count   <lambda>
grp                  
a       18  31.722555

Data:

df.loc[10, "values"] = 1
df.loc[20, ["time","values"]] = ["13:35", 0]
df
     time    values
0   11:55  0.940353
1   12:00  0.919144
2   12:05  0.909454
3   12:10  0.904968
4   12:15  0.867957
5   12:20  0.801426
6   12:25  0.794733
7   12:30  0.770106
8   12:35  0.741985
9   12:40  0.671444
10  12:45  1.000000 # <-- updated
11  12:50  0.496972
12  12:55  0.457803
13  13:00  0.446388
14  13:05  0.430217
15  13:10  0.379902
16  13:15  0.321828
17  13:20  0.298304
18  13:25  0.442079
19  13:30  0.634764
20  13:35  0.000000  # <-- updated

Note: Using values as a data frame column name is fraught, as .values is also a Pandas property. So df.values won't refer to the column, you need df["values"] in that case.

while this Ans is helpful , as my data contains over 8 K rows I need to calculate and report all the occurrences with stretch of data wherever there is a consistent fall of (more than Or) equal to 7 times. Can we achieve this purpose with diff().lt(0).sum() ? — Suraj_j, Jul 21 '18 at 04:19
just wondering whether the Pandas itertools can be useful here.. (I have never had the opportunity to use it ever) — Suraj_j, Jul 21 '18 at 04:22
Yes, consecutively decreasing Values; their # ; and their Degree is required here. You are Correct JAponte. — Suraj_j, Jul 21 '18 at 04:32
Yes, understood - this answer doesn't address OP's needs. Update coming! — andrew_reece, Jul 21 '18 at 04:48
suraj_Blore see updated solution. But can you clarify whether you only need the max number of records, as in @JAponte's solution, or whether you need reports on each sequence of decreasing values, as in my solution? — andrew_reece, Jul 21 '18 at 05:54

JAponte · Accepted Answer · 2018-07-21T17:36:21.807

Here is a reproducible example with a function that accomplishes your goal. I'm assuming you might have multiple sequences of decreasing records, so the function returns the count of decreasing records and the percent of the total decrease from highest to lowest values for each decreasing sequence.

import pandas as pd


def get_decreasing_count_and_percentage(df, min_count=7):
    is_dec = df["value"].diff().lt(0).values
    cnt = 0
    starting_value = df["value"].values[0]
    result = []
    for i in range(len(is_dec)):
        if is_dec[i]:
            cnt += 1
        else:
            if cnt > 0:
                percent = round((df["value"].values[i-1] / starting_value) * 100., 1)
                result.append((cnt+1, percent))
                cnt = 0
                starting_value = df["value"].values[i]
    result_df = pd.DataFrame.from_records(result, columns=['count', 'percentage'])
    return result_df[result_df["count"] >= min_count]

Original Data

times = ['11:55', '12:00', '12:05', '12:10', '12:15', '12:20', '12:25', '12:30', '12:35', '12:40', '12:45', '12:50', '12:55', '13:00', '13:05', '13:10', '13:15', '13:20', '13:25', '13:30']
values = [0.940353, 0.919144, 0.909454, 0.904968, 0.867957, 0.801426, 0.794733, 0.770106, 0.741985, 0.671444, 0.558297, 0.496972, 0.457803, 0.446388, 0.430217, 0.379902, 0.321828, 0.298304, 0.442079, 0.634764, ]

my_df = pd.DataFrame(data={'time': times, 'value': values})
print(get_decreasing_count_and_percentage(my_df))

Result

   count  percentage
0     18        31.7

Modified Data:

values2 = [0.940353, 0.919144, 0.909454, 0.904968, 0.867957, 0.801426, 0.894733, 0.770106, 0.741985, 0.671444, 0.558297, 0.496972, 0.457803, 0.446388, 0.430217, 0.379902, 0.321828, 0.298304, 0.442079, 0.634764]
my_df2 = pd.DataFrame(data={'time': times, 'value': values2})
print(get_decreasing_count_and_percentage(my_df2))
print(get_decreasing_count_and_percentage(my_df2, min_count=6))

Result

   count  percentage
1     12        33.3
   count  percentage
0      6        85.2
1     12        33.3

UPDATE: Updated the code to address percent value requirement.

UPDATE 2: The function now returns a data frame with a summary for all decreasing sequences. Also added a modified data set to show multiple decreasing sequences.

UPDATE 3: Added min_count=7 default parameter to satisfy OP's requirement on a comment (i.e. report on sequences of length >= 7).

@suraj_Blore did any of the answers solve the problem? If so, do you mind selecting the best answer? — JAponte, Jul 27 '18 at 03:21

Quickbeam2k1 · Answer 3 · 2018-07-21T14:43:47.290

I added two more rows to your DataFrame:

frame = pd.concat([df.rename(columns={'values': 'value'}), 
                   pd.DataFrame({'time':['13:35', '13:40'], 'value':[0.5, 0.9]})],
                  ignore_index=True)

Using cumsumyou can create groups of consecutive elements

groups = frame.assign(group = frame.value.diff().fillna(1).gt(0).cumsum())
grouper = groups.groupby('group')

This is "stolen" from here

Now just calculate the values you are interested in, e.g.:

counts = grouper['value'].transform(lambda x: x.agg('count'))
pct = grouper['value'].apply(lambda x: x.pct_change())

assign and filter out groups with a count of only 1

groups.assign(counts=counts, pct=pct, group=groups).query('counts > 1')
    time    value   group   counts  pct
0   11:55   0.940353    11:55   18.0    NaN
1   12:00   0.919144    12:00   18.0    -0.022554
2   12:05   0.909454    12:05   18.0    -0.010542
3   12:10   0.904968    12:10   18.0    -0.004933
4   12:15   0.867957    12:15   18.0    -0.040898
5   12:20   0.801426    12:20   18.0    -0.076652
6   12:25   0.794733    12:25   18.0    -0.008351
7   12:30   0.770106    12:30   18.0    -0.030988
8   12:35   0.741985    12:35   18.0    -0.036516
9   12:40   0.671444    12:40   18.0    -0.095071
10  12:45   0.558297    12:45   18.0    -0.168513
11  12:50   0.496972    12:50   18.0    -0.109843
12  12:55   0.457803    12:55   18.0    -0.078815
13  13:00   0.446388    13:00   18.0    -0.024934
14  13:05   0.430217    13:05   18.0    -0.036226
15  13:10   0.379902    13:10   18.0    -0.116953
16  13:15   0.321828    13:15   18.0    -0.152866
17  13:20   0.298304    13:20   18.0    -0.073095
19  13:30   0.634764    13:30   2.0     NaN
20  13:35   0.500000    13:35   2.0     -0.212306

Nice way of assigning groups! I think though that the percent change OP was looking for was between the first and last of each descending series - not from row to row. — andrew_reece, Jul 21 '18 at 09:41
Thanks! This can be done as well in multiple steps: just use `first()` and `last()` on the desired column and compute the percent change by hand. Using those functions in transform, you could easily compute the pct change fro every row compared to the first. I'll make it more specific if OP gives more detailed information — Quickbeam2k1, Jul 21 '18 at 14:38

score 1 · Answer 4 · answered Dec 31 '19 at 04:04

I learned a lot from the best answer answered by @JAponte. But it is a little wrong. If values becomes

values = [0.940353, 0.999144, 0.909454, 0.904968, 0.867957, 0.801426, 0.794733, 0.770106, 0.741985, 0.671444, 0.558297, 0.496972, 0.457803, 0.446388, 0.430217, 0.379902, 0.321828, 0.298304, 0.442079, 0.634764,]

the result will be output :

count percentage
0 17 31.7

The percentage is obtained by 0.29830 / 0.94035, but the correct answer should be 0.29830 / 0.999144 = 0.299. So I made the following changes:

import pandas as pd


def get_decreasing_count_and_percentage(df, min_count=7):
    is_dec = df["value"].diff().lt(0).values
    cnt = 0
    starting_value = None
    first_time = True
    result = []
    for i in range(len(is_dec)):
        if is_dec[i]:
            cnt += 1
            if first_time:
                starting_value = df["value"].values[i - 1]
                first_time = False
        else:
            if cnt > 0:
                percent = round((df["value"].values[i - 1] / starting_value) * 100., 1)
                result.append((cnt + 1, percent))
                cnt = 0
                first_time = True
    result_df = pd.DataFrame.from_records(result, columns=['count', 'percentage'])
    return result_df[result_df["count"] >= min_count]


times = ['11:55', '12:00', '12:05', '12:10', '12:15', '12:20', '12:25', '12:30', '12:35', '12:40', '12:45', '12:50',
         '12:55', '13:00', '13:05', '13:10', '13:15', '13:20', '13:25', '13:30']
values = [0.940353, 0.999144, 0.909454, 0.904968, 0.867957, 0.801426, 0.794733, 0.770106, 0.741985, 0.671444, 0.558297,
          0.496972, 0.457803, 0.446388, 0.430217, 0.379902, 0.321828, 0.298304, 0.442079, 0.634764, ]

my_df = pd.DataFrame(data={'time': times, 'value': values})
print(get_decreasing_count_and_percentage(my_df))

values2 = [0.940353, 0.919144, 0.909454, 0.904968, 0.867957, 0.801426, 0.894733, 0.770106, 0.741985, 0.671444, 0.558297,
           0.496972, 0.457803, 0.446388, 0.430217, 0.379902, 0.321828, 0.298304, 0.442079, 0.634764]
my_df2 = pd.DataFrame(data={'time': times, 'value': values2})
print(get_decreasing_count_and_percentage(my_df2))
print(get_decreasing_count_and_percentage(my_df2, min_count=6))

pandas Dataframe Consistently falling column values

4 Answers4