Pandas GroupBy time idxmax w/ empty groups throws exception

Question

I have a list of several million events in time stored in a data frame df. df.head() gives output:

                             rel_temp
timestamp                        
2016-04-04 10:41:07.663457   0.02
2016-04-04 11:05:44.382078  -0.05
2016-04-04 11:42:08.777205   0.01
2016-04-04 11:44:54.194832   0.08
2016-04-04 11:49:48.349108  -0.02

I am grouping by time.

grouper = pd.Grouper(freq='10T')
grouped_df = df.groupby(grouper)

I want to get the idxmin and idxmax for all time windows, as I need to know if the minimum or maximum value came first. Any time window that does not contain a measurement I want to have a NaN value.

Thus, for the df above, I want grouped_df['rel_temp'].idxmax() to give this output:

timestamp
2016-04-04 10:40:00        0
2016-04-04 10:50:00      NaN
2016-04-04 11:00:00        0
2016-04-04 11:10:00      NaN
2016-04-04 11:20:00      NaN
2016-04-04 11:30:00      NaN
2016-04-04 11:40:00        1
2016-04-04 11:50:00      NaN

However, as far as I can tell, there is no way to use idxmax() or idxmin() when a groupby may contain empty groups. Doing so causes this exception:

File "/.venv/lib/python3.9/site-packages/pandas/core/groupby/groupby.py", line 947, in wrapper
    return self._python_apply_general(curried, self._obj_with_exclusions)
  File "/.venv/lib/python3.9/site-packages/pandas/core/groupby/groupby.py", line 1287, in _python_apply_general
    keys, values, mutated = self.grouper.apply(f, data, self.axis)
  File "/.venv/lib/python3.9/site-packages/pandas/core/groupby/ops.py", line 820, in apply
    res = f(group)
  File "/.venv/lib/python3.9/site-packages/pandas/core/groupby/groupby.py", line 936, in curried
    return f(x, *args, **kwargs)
  File "/.venv/lib/python3.9/site-packages/pandas/core/series.py", line 2355, in idxmax
    i = self.argmax(axis, skipna, *args, **kwargs)
  File "/.venv/lib/python3.9/site-packages/pandas/core/base.py", line 647, in argmax
    return nanops.nanargmax(  # type: ignore[return-value]
  File "/.venv/lib/python3.9/site-packages/pandas/core/nanops.py", line 93, in _f
    return f(*args, **kwargs)
  File "/.venv/lib/python3.9/site-packages/pandas/core/nanops.py", line 1070, in nanargmax
    result = values.argmax(axis)  # type: ignore[var-annotated]
ValueError: attempt to get argmax of an empty sequence

I also can't seem to find a way to only select the groups which are not empty out of the grouped_df, and then run the idxmax() only on those groups. Like if I could do something like this:

group_counts = grouped_df['rel_temp'].count()
group_not_empty = (group_counts > 0)
grouped_df_no_empty_groups = grouped_df[group_not_empty]
grouped_df_no_empty_groups['rel_temp'].idxmax()

However I can't seem to find a way to do this.

Edit: As per @andrej-kesely, there is a method by using apply; however this is extremely slow as it uses .apply(). I need to be able to process these rows within 20 seconds, which should be possible if there was a direct (non-apply based) method.

Any help is appreciated!

Andrej Kesely · Accepted Answer · 2022-12-12T00:47:43.097

1

If I understand you correctly you want to find an index of maximum value in interval (specified by pd.Grouper):

grouper = pd.Grouper(freq="10T")

df_out = df.groupby(grouper).apply(
    lambda x: x["rel_temp"].reset_index(drop=True).idxmax()
    if len(x)
    else np.nan
)

print(df_out)

Prints:

timestamp
2016-04-04 10:40:00    0.0
2016-04-04 10:50:00    NaN
2016-04-04 11:00:00    0.0
2016-04-04 11:10:00    NaN
2016-04-04 11:20:00    NaN
2016-04-04 11:30:00    NaN
2016-04-04 11:40:00    1.0
Freq: 10T, dtype: float64

EDIT: Another version:

grouper = pd.Grouper(freq="10T")

t, v = [], []
for i, g in df.groupby(grouper):
    t.append(i)

    if len(g):
        v.append(max(enumerate(g["rel_temp"]), key=lambda k: k[1])[0])
    else:
        v.append(np.nan)

out = pd.DataFrame({"timestamp": t, "value": v})
print(out)

Prints:

            timestamp  value
0 2016-04-04 10:40:00    0.0
1 2016-04-04 10:50:00    NaN
2 2016-04-04 11:00:00    0.0
3 2016-04-04 11:10:00    NaN
4 2016-04-04 11:20:00    NaN
5 2016-04-04 11:30:00    NaN
6 2016-04-04 11:40:00    1.0

edited Dec 12 '22 at 00:47

answered Dec 11 '22 at 23:30

Andrej Kesely

168,389
15
48
91

I appreciate this solution, but after trying this it is _extremely slow_ as it uses .apply(). It takes over 5 minutes on a sample set of a couple million rows. I need to process sets of this size in under 20 seconds, so I need to use aggregations without .apply() – chazzmoney Dec 12 '22 at 00:35
@chazzmoney I've posted another version you can try. Also, I suggest to run profiler on your code and see where is the main bottleneck. – Andrej Kesely Dec 12 '22 at 00:48
I will try the new version, thanks. As for the bottleneck, it is the .apply() call itself. The inner portion runs in 0.0014 seconds, but it is called a couple hundred thousand times because of the mechanism of .apply() – chazzmoney Dec 12 '22 at 06:57

Pandas GroupBy time idxmax w/ empty groups throws exception

1 Answers1