8

I have a data frame structured like below:

   group  maybe_start  maybe_end
0    ABC        False      False
1    ABC         True      False
2    ABC        False      False
3    ABC        False      False
4    ABC         True      False
5    ABC        False      False
6    ABC        False       True
7    ABC        False      False
8    DEF        False      False
9    DEF        False      False
10   DEF         True      False
11   DEF        False      False
12   DEF        False       True
13   DEF        False      False
14   DEF        False      False
15   DEF        False       True
16   DEF         True      False
17   DEF        False      False
18   DEF        False       True

I need to create a separate column, let's say group2, that will note the group defined by the moments of start and end. Therefore, every group in group2 should start, whenever there's first True value in maybe_start column after previous maybe_end==True and end on the first occurrance of maybe_end==True after the start. In other words, we start a new value in group2 at maybe_start==True (in this example in row 1) and every next row of group2 will get the same value until there's occurance of maybe_end==True (here, in row 6). All of this needs to be done within groupby where groups are created based on the group column. Therefore, the expected output should look as follows:

   group  maybe_start  maybe_end  group2
0    ABC        False      False     NaN
1    ABC         True      False     1.0
2    ABC        False      False     1.0
3    ABC        False      False     1.0
4    ABC         True      False     1.0
5    ABC        False      False     1.0
6    ABC        False       True     1.0
7    ABC        False      False     NaN
0    DEF        False      False     NaN
1    DEF        False      False     NaN
2    DEF         True      False     1.0
3    DEF        False      False     1.0
4    DEF        False       True     1.0
5    DEF        False      False     NaN
6    DEF        False      False     NaN
7    DEF        False       True     NaN
8    DEF         True      False     2.0
9    DEF        False      False     2.0
10   DEF        False       True     2.0 

How can I achieve this in vectorised way in Pandas?

OCa
  • 298
  • 2
  • 13
Xaume
  • 293
  • 2
  • 16
  • 1
    There's a corner case which need to be clarified, when both `maybe_start` and `maybe_end` are `True` – Vitalizzare Aug 01 '23 at 15:11
  • As per answer below, if both start and end are true, there are two cases: 1) previous sequence ended before - then, the row should start a new group and end it in the same row, meaning only this particular row would have the new value of group id; 2) there was a start=true somewhere before the row, but it ends only in the current row - then the group should start at previous start=True and end in the current row – Xaume Aug 07 '23 at 11:22
  • @Xaume can you please check the answers provided ? – Amira Bedhiafi Aug 07 '23 at 17:45

5 Answers5

3
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'group': ['ABC'] * 8 + ['DEF'] * 11,
    'maybe_start': [False, True, False, False, True, False, False, False, False, False, True, False, False, False, False, False, True, False, False],
    'maybe_end': [False, False, False, False, False, False, True, False, False, False, False, False, True, False, False, True, False, False, True]
})

def apply_func(df):
    df['group2'] = np.nan
    counter = 0
    started = False
    for idx, row in df.iterrows():
        if row['maybe_start'] and not started:
            counter += 1
            started = True
        if started:
            df.loc[idx, 'group2'] = counter
        if row['maybe_end']:
            started = False
    return df

df = df.groupby('group').apply(apply_func)

print(df)

enter image description here

Amira Bedhiafi
  • 8,088
  • 6
  • 24
  • 60
2

I'm posting another answer as StackOverflow won't me allow edit the old answer.


As stated, it's hard do vectorize this code, but you can try to speed up the computation using numba:

from numba import njit


@njit
def numba_fn(out, x_start, x_end):
    g = 1.0
    state = 0

    for i in range(len(out)):
        start = x_start[i]
        end = x_end[i]

        if start and end:
            if state:
                out[i] = g
                state = False
            else:
                out[i] = g
                g += 1
        elif start:
            out[i] = g
            state = True
        elif end:
            if state:
                out[i] = g
                state = False
                g += 1
        elif state:
            out[i] = g

def fn(x):
    x["group2"] = np.nan
    numba_fn(x["group2"].values, x["maybe_start"].values, x["maybe_end"].values)
    return x


print(df.groupby("group", group_keys=False).apply(fn))

Prints:

   group  maybe_start  maybe_end  group2
0    ABC        False      False     NaN
1    ABC         True      False     1.0
2    ABC        False      False     1.0
3    ABC        False      False     1.0
4    ABC         True      False     1.0
5    ABC        False      False     1.0
6    ABC        False       True     1.0
7    ABC        False      False     NaN
8    DEF        False      False     NaN
9    DEF        False      False     NaN
10   DEF         True      False     1.0
11   DEF        False      False     1.0
12   DEF        False       True     1.0
13   DEF        False      False     NaN
14   DEF        False      False     NaN
15   DEF        False       True     NaN
16   DEF         True      False     2.0
17   DEF        False      False     2.0
18   DEF        False      True      2.0
19   DEF        True       True      3.0
20   DEF        False      False     NaN
21   DEF        True       False     4.0
22   DEF        True       True      4.0
23   DEF        False      False     NaN

BUT The main bottleneck seems to be the pd.Groupby (it creates new dataframe for each group, which slows down the computation).

You can use np.bincount to simulate the .groupby() (assuming the dataframe is sorted by "group"):

d = np.bincount(
    pd.Categorical(df["group"]).codes,
)
d = [0, *d.cumsum()]

df["group2"] = np.nan

values_group2 = df["group2"].to_numpy()
values_start = df["maybe_start"].to_numpy()
values_end = df["maybe_end"].to_numpy()

for a, b in zip(d, d[1:]):
    numba_fn(values_group2[a:b], values_start[a:b], values_end[a:b])

print(df)

A benchmark (with dataframe of 1000 groups with each group has 1000 elements):

from timeit import timeit

import numpy as np
import pandas as pd
from numba import njit


@njit
def numba_fn(out, x_start, x_end):
    g = 1.0
    state = 0

    for i in range(len(out)):
        start = x_start[i]
        end = x_end[i]

        if start and end:
            if state:
                out[i] = g
                state = False
            else:
                out[i] = g
                g += 1
        elif start:
            out[i] = g
            state = True
        elif end:
            if state:
                out[i] = g
                state = False
                g += 1
        elif state:
            out[i] = g


def normal_fn(x):
    out, g, state = [], 1, False
    for start, end in zip(x.maybe_start, x.maybe_end):
        if start and end:
            if state:
                out.append(g)
                state = False
            else:
                out.append(g)
                g += 1
        elif start:
            out.append(g)
            state = True
        elif end:
            if state:
                out.append(g)
                state = False
                g += 1
            else:
                out.append(np.nan)
        elif state:
            out.append(g)
        else:
            out.append(np.nan)

    x["group2"] = out
    return x


def fn(x):
    x["group2"] = np.nan
    numba_fn(x["group2"].values, x["maybe_start"].values, x["maybe_end"].values)
    return x


def test_groupby_normal_fn(df):
    return df.groupby("group", group_keys=False).apply(normal_fn)


def test_groupby_numba_fn(df):
    return df.groupby("group", group_keys=False).apply(fn)


def test_numpy_groupby_numba_fn(df):
    d = np.bincount(
        pd.Categorical(df["group"]).codes,
    )
    d = [0, *d.cumsum()]

    df["group2"] = np.nan

    values_group2 = df["group2"].to_numpy()
    values_start = df["maybe_start"].to_numpy()
    values_end = df["maybe_end"].to_numpy()

    for a, b in zip(d, d[1:]):
        numba_fn(values_group2[a:b], values_start[a:b], values_end[a:b])

    return df


def generate_df(num_groups=1000, elements_in_group=1000):
    from random import randint, seed

    seed(42)

    out = []
    for g in range(num_groups):
        for _ in range(elements_in_group):
            out.append((str(g), bool(randint(0, 1)), bool(randint(0, 1))))

    return pd.DataFrame(out, columns=["group", "maybe_start", "maybe_end"])


df = generate_df()

# test if the algorithm is correct:

df1 = test_groupby_normal_fn(df.copy())
df2 = test_groupby_numba_fn(df.copy())
df3 = test_numpy_groupby_numba_fn(df.copy())

np.testing.assert_equal(df1["group2"].values, df2["group2"].values)
np.testing.assert_equal(df1["group2"].values, df3["group2"].values)

t1 = timeit(
    "test_groupby_normal_fn(x)", setup="x=df.copy()", number=1, globals=globals()
)
t2 = timeit(
    "test_groupby_numba_fn(x)", setup="x=df.copy()", number=1, globals=globals()
)
t3 = timeit(
    "test_numpy_groupby_numba_fn(x)", setup="x=df.copy()", number=1, globals=globals()
)

print("test_groupby_normal_fn =", t1)
print("test_groupby_numba_fn =", t2)
print("test_numpy_groupby_numba_fn =", t3)

Prints on my machine (AMD 5700x, python==3.11.4, pandas=2.0.3, numpy==1.24.4, numba==0.57.1):

test_groupby_normal_fn = 0.47854723408818245
test_groupby_numba_fn = 0.30540501209907234
test_numpy_groupby_numba_fn = 0.03548573097214103

The numpy groupby + numba JIT is ~14x faster than normal pd.Groupby + .apply

Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • @Andrey Kesely.... In your solution for the generated dataframe. I think row index 12 show be a new group. Since row index 11 starts and stops and is two 2, Row index 12 should be 3 in my opinion. – Scott Boston Aug 02 '23 at 04:10
  • @ScottBoston I think the results are OK (there could be multiple starts, but the group ends only if the end is True) That's why it's hard to vectorize (there is this state variable that the values depend on and computation must be done via loopy code). – Andrej Kesely Aug 02 '23 at 08:20
  • @ScottBoston Or If you mean if both start/end is True? Yeah, that OP must clarify (if this situation ever happens, and what should happen) – Andrej Kesely Aug 02 '23 at 08:25
  • @AndrejKesely if both start and end are true, there are two cases: 1) previous sequence ended before - then, the row should start a new group and end it in the same row, meaning only this particular row would have the new value of group id; 2) there was a start=true somewhere before the row, but it ends only in the current row - then the group should start at previous start=True and end in the current row – Xaume Aug 05 '23 at 08:27
  • @Xaume I've edited my answer (added some additional `if`-logic). But the performance is the same: `numpy` + `numba` has great benefit here. – Andrej Kesely Aug 05 '23 at 10:00
  • 1
    @ScottBoston I've edited the logic based on OP comments. – Andrej Kesely Aug 05 '23 at 10:01
1

You can try:

def fn(x):
    out, g, state = [], 1, False
    for start, end in zip(x.maybe_start, x.maybe_end):
        if not state and start:
            out.append(g)
            state = True
        elif state and end:
            out.append(g)
            state = False
            g += 1
        elif state:
            out.append(g)
        else:
            out.append(np.nan)

    x['group2'] = out
    return x


out = df.groupby('group', group_keys=False).apply(fn)
print(out)

Prints:

   group  maybe_start  maybe_end  group2
0    ABC        False      False     NaN
1    ABC         True      False     1.0
2    ABC        False      False     1.0
3    ABC        False      False     1.0
4    ABC         True      False     1.0
5    ABC        False      False     1.0
6    ABC        False       True     1.0
7    ABC        False      False     NaN
8    DEF        False      False     NaN
9    DEF        False      False     NaN
10   DEF         True      False     1.0
11   DEF        False      False     1.0
12   DEF        False       True     1.0
13   DEF        False      False     NaN
14   DEF        False      False     NaN
15   DEF        False       True     NaN
16   DEF         True      False     2.0
17   DEF        False      False     2.0
18   DEF        False       True     2.0
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • Thanks. However, my original dataset is quite large and for this reason, I need to do it in a vectorised operation due to performance. – Xaume Jul 29 '23 at 14:39
  • @Xaume I've posted another answer (SO won't me allow to edit existing answer). Using numba + numpy for groupby yield pretty significant results. – Andrej Kesely Aug 01 '23 at 22:23
1

IMO the process in this case hardly can be fully vectorised, because the calculated value depends not only on the record itself but also on a varying number of neighbors. What we can do though is try to minimize the number of total comparison, which can help in case of very long and sparse sequences.

To start with, let's drop group and work only with maybe_start, maybe_end columns:

import pandas as pd

data = {
    'maybe_start': [0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0],
    'maybe_end'  : [1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1]
}
df = pd.DataFrame(data, dtype=bool)

Now, the main idea is to split the index in intervals which can be assigned to a constant value, sort of:

for mark, (start, stop) in enumerate(intervals, start=1):
    marks[start:stop] = mark

So let's get these intervals, assuming df.index is of type RangeIndex.

# extract all maybe_start points including the very last index
# which is needed to build a correct sequence of intervals
# between maybe_start points
start = df.index[df['maybe_start'] | pd.Series(True, df.index[-1:])]

# extract all maybe_end points
stop = df.index[df['maybe_end']]

Next, we group stop points by intervals between start:

intervals = (
    stop
    .to_series()
    .groupby(pd.cut(stop, start))
    # get maybe_end which is closest to the left end of the current interval
    # I guess we could use here .first() as well becaouse the index is sorted
    .min()
    # expand endpoints to the previous empty ranges
    .backfill()
)

# keep only left ends of intervals in index
# Note: all categories are used as index
#    and the index is sorted at this point,
#    that's why the following step is possible
intervals.index = intervals.index.categories.left

# get DataFrame({'first':..., 'last':...})
# where each record contains indexes of closed intervals
# which is going to be marked by the record's index
intervals = (
    intervals
    .drop_duplicates()   # keep the first by default (!)
    # values in this series are the right ends of intervals to mark
    .rename('last')      
    # indexes in the series are the left ends of intervals to mark 
    .rename_axis(index='first')
    .reset_index()
)

# start indexing of the final intervals from 1
intervals.index += 1

Here's what we get at this point on our test data:

intervals

With this, we're ready to mark up the data:

df['marks'] = float('nan')

for mark, (first, last) in intervals.iterrows():
    df.loc[first:last, 'marks'] = mark

And here's what we have so far:

marked dataframe

Now let's take into account the group column of the original data:

from io import StrigIO

data = '''index   group  maybe_start  maybe_end
0    ABC        False      False
1    ABC         True      False
2    ABC        False      False
3    ABC        False      False
4    ABC         True      False
5    ABC        False      False
6    ABC        False       True
7    ABC        False      False
8    DEF        False      False
9    DEF        False      False
10   DEF         True      False
11   DEF        False      False
12   DEF        False       True
13   DEF        False      False
14   DEF        False      False
15   DEF        False       True
16   DEF         True      False
17   DEF        False      False
18   DEF        False       True
'''
original_df = pd.read_csv(StringIO(data), delim_whitespace=True, index_col=0)
# let's make group values not sorted in test_df
test_df = pd.concat([original_df, original_df]).reset_index(drop=True)

def markup(df: pd.DataFrame) -> pd.Series:
    start = df.index[df['maybe_start'] | pd.Series(True, df.index[-1:])]
    stop = df.index[df['maybe_end']]
    intervals = (
        stop
        .to_series()
        .groupby(pd.cut(stop, start))
        .min()
        .backfill()
    )
    intervals.index = intervals.index.categories.left
    intervals = (
        intervals
        .drop_duplicates()
        .rename('last')      
        .rename_axis(index='first')
        .reset_index()
    )
    intervals.index += 1
    df['marks'] = float('nan')
    for mark, (first, last) in intervals.iterrows():
        df.loc[first:last, 'marks'] = mark
    return df

marked_test = (
    test_df
    .groupby('group', group_keys=False)
    .apply(markup)
)

And here's the final output:

marked original data


python         : 3.11.0
pandas         : 1.5.1
Vitalizzare
  • 4,496
  • 7
  • 13
  • 32
1

2 steps: validate presence, then assign group number.

Presence

  • Function next_true() allows to jump back and forth between actually used rows of 'maybe_start' and 'maybe_end', flagging actual starts and times of presence, in the process.
  • Neither groupby nor full vectorization here because the dependence on history thwarted my attempts. At least, the while loop doesn't iterate over the whole index.
# 1) Tools for readability

def next_true(df, col):
    '''
    Index of earliest next True in given column
    - df is a dataframe,
    - col is a column name
    '''
    return min(df.loc[df[col]].index)

def cumsumreset(signal_sum, signal_reset):
    '''
    Cumulative sum of signal_sum that resets whenever signal_reset hits False.
    signal_sum, signal_reset are dataframe columns
    '''
    return signal_sum.cumsum() - signal_sum.cumsum().where(~signal_reset).ffill().fillna(0).astype(int)

# 2) Identify actual starts and presence

df[['actual_start','present']] = False
j=0
while j < len(df)-1:
    i = next_true(df.iloc[j:],'maybe_start')
    j = next_true(df.iloc[i:],'maybe_end')
    df.loc[i  ,'actual_start'] = True
    df.loc[i:j,'present']      = True

Grouping

# 3) Flag group change (ABC --> DEF --> ...)

df['turnover'] = df['group']!=df['group'].shift(-1)#True whenever the group is *about to* change. 
df.loc[len(df)-1,'turnover'] = False# fix last row being True because of .shift () producing NaN.

# 4) Assign group number (based on cumulative sum resetting upon group change)
df['group2'] = cumsumreset(df['actual_start'], ~df['turnover'])

# 5) Finally remove values that need deleting (no presence)
df.loc[(~df['present']),'group2']=None

In the full output presented below, 'group2' is identical to 'expected' from OP.

   group  maybe_start  maybe_end  expected  actual_start  present  turnover  group2
0    ABC        False      False       NaN         False    False     False     NaN
1    ABC         True      False       1.0          True     True     False     1.0
2    ABC        False      False       1.0         False     True     False     1.0
3    ABC        False      False       1.0         False     True     False     1.0
4    ABC         True      False       1.0         False     True     False     1.0
5    ABC        False      False       1.0         False     True     False     1.0
6    ABC        False       True       1.0         False     True     False     1.0
7    ABC        False      False       NaN         False    False      True     NaN
8    DEF        False      False       NaN         False    False     False     NaN
9    DEF        False      False       NaN         False    False     False     NaN
10   DEF         True      False       1.0          True     True     False     1.0
11   DEF        False      False       1.0         False     True     False     1.0
12   DEF        False       True       1.0         False     True     False     1.0
13   DEF        False      False       NaN         False    False     False     NaN
14   DEF        False      False       NaN         False    False     False     NaN
15   DEF        False       True       NaN         False    False     False     NaN
16   DEF         True      False       2.0          True     True     False     2.0
17   DEF        False      False       2.0         False     True     False     2.0
18   DEF        False       True       2.0         False     True     False     2.0

@ScottBoston: Copying here below, what I get with your MCVE. Group2 integers are on top and NaNs at the bottom.

   group  maybe_start  maybe_end     ng  ngg     g2  group2
0    ABC        False      False  False    1  False     1.0
1    ABC         True      False   True    1   True     1.0
2    ABC        False      False   True    1   True     1.0
3    ABC        False      False   True    1   True     1.0
4    ABC         True      False   True    1   True     1.0
5    ABC        False      False   True    1   True     1.0
6    ABC        False       True   True    1   True     1.0
7    ABC        False      False  False    2  False     1.0
8    DEF        False      False  False    3  False     1.0
9    DEF        False      False  False    4  False     2.0
10   DEF         True      False   True    4   True     2.0
11   DEF        False      False   True    4   True     2.0
12   DEF        False       True   True    4   True     NaN
13   DEF        False      False  False    5  False     NaN
14   DEF        False      False   True    5  False     NaN
15   DEF        False       True   True    5  False     NaN
16   DEF         True      False  False    6   True     NaN
17   DEF        False      False   True    6   True     NaN
18   DEF        False       True   True    6   True     NaN

It comes with

FutureWarning: Not prepending group keys to the result index of transform-like apply. In the future, the group keys will be included in the index, regardless of whether the applied function returns a like-indexed object.
To preserve the previous behavior, use

    >>> .groupby(..., group_keys=False)

To adopt the future behavior and silence this warning, use 

    >>> .groupby(..., group_keys=True)
  df[df["g2"]]

Following this warning, correct MCVE output is obtained using, as the last groupby: (i.e. inserting ,group_keys=True)

df["group2"] = (
        df[df["g2"]]
        .groupby("group",group_keys=True)["ngg"]
        .apply(lambda x: x.astype("category").cat.codes + 1)
        .reset_index(level=0, drop=True)
    )

Are we operating from too different versions? My pandas version is

pd.__version__
'1.5.3'
OCa
  • 298
  • 2
  • 13
  • 1
    Ah.. Yes, in 1.5.3 I fixed the problem. Replace this line. `df.loc[df["g2"], "group2"] = ( df[df["g2"]] .groupby("group", group_keys=False)["ngg"] .apply(lambda x: x.astype("category").cat.codes + 1) .reset_index(level=0, drop=True) )` – Scott Boston Aug 04 '23 at 16:40
  • In 1.5.3, I needed to lineup the indexing explicitly using .loc on the left of the equals sign. – Scott Boston Aug 04 '23 at 16:41