changing a cell value based on another cells value in a dataframe

Question

I'm trying to determine how many days in a row a certain column is above or below a certain threshold.

>>> df.head()
            Open   High    Low  Close  Volume
Date
2004-08-19  49.96  51.98  47.93  50.12     NaN
2004-08-20  50.69  54.49  50.20  54.10     NaN
2004-08-23  55.32  56.68  54.47  54.65     NaN
2004-08-24  55.56  55.74  51.73  52.38     NaN
2004-08-25  52.43  53.95  51.89  52.95     NaN
>>>

For the example above, I'd like another column df['RDA'] to increment for each consecutive day that the column Open is over 50. For each consecutive day below 50, I'd like a second column df['RDB'] to increment and df['RDA'] to reset to 0. I've tried if/then logic but it doesn't like that and gives me a value error:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). how can i sort it out

I'd like the output of my dataframe to look like this:

>>> df.head()
            Open   High    Low  Close  Volume    RDA   RDB
Date
2004-08-19  51.96  51.98  47.93  50.12     NaN    1      0
2004-08-20  50.69  54.49  50.20  54.10     NaN    2      0
2004-08-23  55.32  56.68  54.47  54.65     NaN    3      0
2004-08-24  45.56  55.74  51.73  52.38     NaN    0      1
2004-08-25  42.43  53.95  51.89  52.95     NaN    0      2
2004-08-26  41.96  51.98  47.93  50.12     NaN    0      3
2004-08-27  40.69  54.49  50.20  54.10     NaN    0      4
2004-08-28  55.32  56.68  54.47  54.65     NaN    1      0
2004-08-29  55.56  55.74  51.73  52.38     NaN    2      0
2004-08-30  52.43  53.95  51.89  52.95     NaN    3      0
>>>

Is this something that's possible with Pandas? I know you can get a count of the values in a column, but I've been so far unable to find a method for with consecutive values. An if/then statement with 2 variables would work, but like I mentioned above, I get a value error when I tried that. Any help would be appreciated.

piRSquared · Answer 1 · 2017-08-09T01:56:17.643

2

I'll use np.sign on the difference between Open and 50. It will be -1 when less than 50, 0 when exactly 50, and 1 when greater than 50.
Next I'll use np.diff to identify when it switches from one value to another
Then I'll use cumsum to define the groups of consecutive signs
Next I'll use cumcount to get counts within groups
Finally I'll use np.where to split the cumcounts

o = df.Open.values - 50
signs = np.sign(o)
changes = np.append(False, signs[:-1] != signs[1:])
g = changes.cumsum()
cumcounts = df.groupby(g).cumcount() + 1

a = np.where(signs == 1,  cumcounts, 0)
b = np.where(signs == -1, cumcounts, 0)

df.assign(RDA=a, RDB=b)

             Open   High    Low  Close  Volume  RDA  RDB
Date                                                    
2004-08-19  51.96  51.98  47.93  50.12     NaN    1    0
2004-08-20  50.69  54.49  50.20  54.10     NaN    2    0
2004-08-23  55.32  56.68  54.47  54.65     NaN    3    0
2004-08-24  45.56  55.74  51.73  52.38     NaN    0    1
2004-08-25  42.43  53.95  51.89  52.95     NaN    0    2
2004-08-26  41.96  51.98  47.93  50.12     NaN    0    3
2004-08-27  40.69  54.49  50.20  54.10     NaN    0    4
2004-08-28  55.32  56.68  54.47  54.65     NaN    1    0
2004-08-29  55.56  55.74  51.73  52.38     NaN    2    0
2004-08-30  52.43  53.95  51.89  52.95     NaN    3    0

edited Aug 09 '17 at 01:56

answered Aug 09 '17 at 00:21

piRSquared

285,575
57
475
624

Can you provide an explanation with this? – Cory Madden Aug 09 '17 at 00:27
@CoryMadden I answered this off top of my head trying to hurry to make a plane. I don't even know if it works. I'll try to add some explanation when I can. It may be a bit now and a bit later. – piRSquared Aug 09 '17 at 00:34
@Alexander I'll edit it when I can get on my laptop. Thx for the heads up – piRSquared Aug 09 '17 at 00:42
Definitely interesting. Thanks for adding the explanation. – Cory Madden Aug 09 '17 at 00:44
@Alexander it works now. I had to add one to cumcount. – piRSquared Aug 09 '17 at 01:56
Great! Do you have time for timings? – Alexander Aug 09 '17 at 02:06
@Alexander unfortunately, I won't be on for the rest of the night. – piRSquared Aug 09 '17 at 02:23

Alexander · Accepted Answer · 2017-08-09T01:26:22.850

First, add a flag column to the dataframe to indicate if the Open is higher than the target price of 50 (True or False).

Then, you can use the compare-cumsum-groupby pattern to identify cumulative groupings of this flag and apply a cumsum to each such group.

We now need to invert the flag so that ones are zeroes and zeroes are ones and then use the same strategy to calculate rdb.

Finally, we drop the flag column (I used .iloc[:, :-1] to drop it since I added it as the last column) and append the new RDA and RDB columns.

target_price = 50
df = df.assign(flag=df.Open.gt(target_price))  # True if `Open` greater than `target_price`, otherwise False.

rda = df.groupby((df['flag'] != df['flag'].shift()).cumsum()).flag.cumsum()
df['flag'] = ~df['flag']  # Invert flag for RDB.
rdb = df.groupby((df['flag'] != df['flag'].shift()).cumsum()).flag.cumsum()

df = df.iloc[:, :-1].assign(RDA=rda, RDB=rdb)
>>> df
      Date   Open   High    Low  Close  Volume  RDA  RDB
0  8/19/04  51.96  51.98  47.93  50.12     NaN    1    0
1  8/20/04  50.69  54.49  50.20  54.10     NaN    2    0
2  8/23/04  55.32  56.68  54.47  54.65     NaN    3    0
3  8/24/04  45.56  55.74  51.73  52.38     NaN    0    1
4  8/25/04  42.43  53.95  51.89  52.95     NaN    0    2
5  8/26/04  41.96  51.98  47.93  50.12     NaN    0    3
6  8/27/04  40.69  54.49  50.20  54.10     NaN    0    4
7  8/28/04  55.32  56.68  54.47  54.65     NaN    1    0
8  8/29/04  55.56  55.74  51.73  52.38     NaN    2    0
9  8/30/04  52.43  53.95  51.89  52.95     NaN    3    0

Jorden · Answer 3 · 2017-08-09T01:36:45.930

This can also be done using the Python provided functools.reduce method. First create an iterable of your target data, so in your case:

target = df.Open > 50

This will be what you pass into functools.reduce later on to be "reduced". Reduce is essentially map, but retains a value across list elements. This can be leveraged to do what you are asking.

I'll try and break down the function you could use (shown in full at the end of the post).

functools.reduce gives you access to two parameters. Your accumulated value, and the current list item you are at. It also allows you to pass in your own initializer (first item before anything is looked at). With this we can go through our list, and if it is True, as determined by our target series above, we can add 1 to the last element in the list, otherwise add a 0 to our accumulator.

This requires a little bit of finessing by setting the initializer to a list with the value 0 in it like [0], so that on the first pass it can take the "last" element and do something with it without erring out.

Once completed there will be that straggler 0 at the front of the list which you can just remove using a slice [1:] to take only the second element and onwards.

Your RDB column is the exact same, except you want to make sure it is NOT True in your target list, which just requires adding a not in your conditional statement.

The full code would look like this:

import functools

# Create a boolean series of your Open column
target = df.Open > 50

# For every item in your boolean series add a 1 to the previous value if it's over 50, otherwise reset
df['RDA'] = functools.reduce(lambda x, y: x + ([x[-1] + 1] if y else [0]), target, [0])[1:]
# Repeat, but for every `False` value in the series
df['RDB'] = functools.reduce(lambda x, y: x + ([x[-1] + 1] if not y else [0]), target, [0])[1:]

>>> df.head()
            Open    High    Low Close   Volume  RDA RDB
Date                            
2004-08-19  49.96   51.98   47.93   50.12   NaN 0   1
2004-08-20  50.69   54.49   50.20   54.10   NaN 1   0
2004-08-23  55.32   56.68   54.47   54.65   NaN 2   0
2004-08-24  55.56   55.74   51.73   52.38   NaN 3   0
2004-08-25  52.43   53.95   51.89   52.95   NaN 4   0

Anirudh Bandi · Answer 4 · 2017-08-09T00:31:57.090

def fun(d):
    j = 0
    flag = 2
    rda = []
    rdb = []
    for i in range(d.shape[0]):
        if d.loc[i,'Open'] >= 50:
            if flag == 1:
                j = 0
            j = j + 1
            rda.append(j)
            rdb.append(0)
            flag = 0
        else:
            if flag == 0:
                j=0
            j = j + 1
            rdb.append(j)
            rda.append(0)
            flag = 1

    return rda,rdb

df['RDA'],df['RDB'] = fun(df)

Cory Madden · Answer 5 · 2017-08-09T00:46:54.410

I originally misunderstood and didn't realize you wanted to restart the count whenever it changed. I think the best way to do this would be to use DataFrame.apply across the rows like this:

In [226]: def increment(row):
     ...:     global rda
     ...:     global rdb
     ...:     if row.Open > 50:
     ...:         row.RDA = int(next(rda))
     ...:         rdb = count()
     ...:     else:
     ...:         row.RDB = next(rdb)
     ...:         rda = int(count())
     ...:     return row
In [227]: df['RDA'] = 0
In [228]: df['RDB'] = 0
In [229]: df.apply(increment, axis=1)
             Open   High    Low  Close  Volume  RDA  RDB
Date                                                    
2004-08-19  49.96  51.98  47.93  50.12     NaN  0.0  1.0
2004-08-20  50.69  54.49  50.20  54.10     NaN  0.0  0.0
2004-08-23  55.32  56.68  54.47  54.65     NaN  1.0  0.0
2004-08-24  55.56  55.74  51.73  52.38     NaN  2.0  0.0
2004-08-25  52.43  53.95  51.89  52.95     NaN  3.0  0.0

I don't know why they're showing up as floats in the columns, I guess pandas thinks that's what you want. The data originally comes from count as an int. And I'm not normally a fan of globals, but DataFrame.apply couldn't access the variables when they were outside the increment function.

changing a cell value based on another cells value in a dataframe

5 Answers5