Efficiently calculate non-consecutive number of appereances of element in data frame

Question

Considering the following data frame

                                     Value
time
2020-02-14 14:16:10.769999872+00:00     74
2020-02-14 14:16:11.360999936+00:00     74
2020-02-14 14:16:11.970000128+00:00     72
2020-02-14 14:16:12.637000192+00:00     72
2020-02-14 14:16:13.210000128+00:00     74
...                                    ...
2020-02-28 08:15:20.340000+00:00        71
2020-02-28 08:15:20.890000128+00:00     71
2020-02-28 08:15:21.424000+00:00        71
2020-02-28 08:15:22.032999936+00:00     72
2020-02-28 08:15:22.594000128+00:00     72

I would like my code to go through the Values, find the start index and end index of each value and save this information into a dictionary.

results = {74: {start:2020-02-14 14:16:10.769999872+00:00, end:2020-02-14 14:16:11.360999936+00:00}, 
           72: {start: ..., end: ...},
           ...}

Because this would be to simple, the tricky part is that one or more values may appear multiple times in a non consecutive way: 74, 74, 72, 72, 72, 74, 74, 74, 71, 71, 71, 72, 72, 71, 71.

If this is the case, then for each Value a new sequence should be generated which contains the start and end index.

results = {74:
               {Sequence1: {start:2020-02-14 14:16:10.769999872+00:00, end:2020-02-14 14:16:11.360999936+00:00},
                Sequence2: {start: ... , end: ...}},
           72: 
               {Sequence1: {start: ..., end: ...},
                Seqeunce2: {start: ..., end: ...},
                Sequence3: {start: ..., end: ...}},
          71: ...,
          }

Naturally I can code this with lots of for-loops but I was wondering if there might be a more neat and clever solution that could spare me the pfaff. And maybe most important of all it is crucial that the code works fast. The data frame has around 300.000 rows.

score 1 · Answer 1 · edited Feb 09 '22 at 09:53

This can be done in two parts. The first consists in findings consecutive groups. The second consists in finding the minimum/maximum time for each group.

To find groups you can use the solution described here. Here is the solution applied in your case:

groups = (df.Value != df.Value.shift()).cumsum()

Then you could just apply multiple groupby to find the starting and ending dates. However, there is a more efficient and straightforward way to do that using agg:

result = df.groupby(groups).agg(Value=('Value',min), startTime=('time',min), endTime=('time',max))

Finally if you want a dict, you can just iterate over the resulting dataframe.

Here is the tested input:

                                  time  Value
0  2020-02-14 14:16:10.769999872+00:00     74
1  2020-02-14 14:16:11.360999936+00:00     74
2  2020-02-14 14:16:11.970000128+00:00     72
3  2020-02-14 14:16:12.637000192+00:00     72
4  2020-02-14 14:16:13.210000128+00:00     74
5     2020-02-28 08:15:20.340000+00:00     71
6  2020-02-28 08:15:20.890000128+00:00     71
7     2020-02-28 08:15:21.424000+00:00     71
8  2020-02-28 08:15:22.032999936+00:00     72
9  2020-02-28 08:15:22.594000128+00:00     72

Here is the output:

       Value                            startTime                              endTime
Value                                                                                 
1         74  2020-02-14 14:16:10.769999872+00:00  2020-02-14 14:16:11.360999936+00:00
2         72  2020-02-14 14:16:11.970000128+00:00  2020-02-14 14:16:12.637000192+00:00
3         74  2020-02-14 14:16:13.210000128+00:00  2020-02-14 14:16:13.210000128+00:00
4         71     2020-02-28 08:15:20.340000+00:00     2020-02-28 08:15:21.424000+00:00
5         72  2020-02-28 08:15:22.032999936+00:00  2020-02-28 08:15:22.594000128+00:00

Note that I tested with input dates encoded as strings, which should be fine since they are expressed according to ISO 8601.

score 0 · Answer 2 · answered Mar 03 '20 at 22:51

I assume that the index is actually a DatetimeIndex. If it isn't, convert it.

To do your task, start from defining a function to be applied to each group of rows:

def fn(grp):
    tMin = grp.index.min()
    tMax = grp.index.max()
    v = grp.Value.iloc[0]
    return pd.Series([v, tMin, tMax], index=['val', 'start', 'end'])

Then apply it to each group of rows with equal Value (a change in Value opens a new group):

df2 = df.groupby([(df.Value != df.Value.shift()).cumsum()])\
    .apply(fn).reset_index(drop=True)

The next step is to generate a column with Sequence... content (first only a number, then convert it into a string):

df2['Seq'] = df2.groupby('val').cumcount() + 1
df2['Seq'] = 'Sequence' + df2['Seq'].astype(str)

And to compute the final result, run:

result = {}
for key, grp in gr:
    result[key] = grp.set_index('Seq')[['start', 'end']].to_dict(orient='index')

For your sample data, the result is:

{71: {'Sequence1': {'start': Timestamp('2020-02-28 08:15:20.340000+0000', tz='UTC'),
   'end': Timestamp('2020-02-28 08:15:21.424000+0000', tz='UTC')}},
 72: {'Sequence1': {'start': Timestamp('2020-02-14 14:16:11.970000128+0000', tz='UTC'),
   'end': Timestamp('2020-02-14 14:16:12.637000192+0000', tz='UTC')},
  'Sequence2': {'start': Timestamp('2020-02-28 08:15:22.032999936+0000', tz='UTC'),
   'end': Timestamp('2020-02-28 08:15:22.594000128+0000', tz='UTC')}},
 74: {'Sequence1': {'start': Timestamp('2020-02-14 14:16:10.769999872+0000', tz='UTC'),
   'end': Timestamp('2020-02-14 14:16:11.360999936+0000', tz='UTC')},
  'Sequence2': {'start': Timestamp('2020-02-14 14:16:13.210000128+0000', tz='UTC'),
   'end': Timestamp('2020-02-14 14:16:13.210000128+0000', tz='UTC')}}}

Note that each value saved under start or end key is an actual Timestamp. It could be also a plain string, but I think that this content is easier for any further processing.

Efficiently calculate non-consecutive number of appereances of element in data frame

2 Answers2