I have a list of several million events in time stored in a data frame df
. df.head()
gives output:
rel_temp
timestamp
2016-04-04 10:41:07.663457 0.02
2016-04-04 11:05:44.382078 -0.05
2016-04-04 11:42:08.777205 0.01
2016-04-04 11:44:54.194832 0.08
2016-04-04 11:49:48.349108 -0.02
I am grouping by time.
grouper = pd.Grouper(freq='10T')
grouped_df = df.groupby(grouper)
I want to get the idxmin and idxmax for all time windows, as I need to know if the minimum or maximum value came first. Any time window that does not contain a measurement I want to have a NaN value.
Thus, for the df above, I want grouped_df['rel_temp'].idxmax()
to give this output:
timestamp
2016-04-04 10:40:00 0
2016-04-04 10:50:00 NaN
2016-04-04 11:00:00 0
2016-04-04 11:10:00 NaN
2016-04-04 11:20:00 NaN
2016-04-04 11:30:00 NaN
2016-04-04 11:40:00 1
2016-04-04 11:50:00 NaN
However, as far as I can tell, there is no way to use idxmax()
or idxmin()
when a groupby may contain empty groups. Doing so causes this exception:
File "/.venv/lib/python3.9/site-packages/pandas/core/groupby/groupby.py", line 947, in wrapper
return self._python_apply_general(curried, self._obj_with_exclusions)
File "/.venv/lib/python3.9/site-packages/pandas/core/groupby/groupby.py", line 1287, in _python_apply_general
keys, values, mutated = self.grouper.apply(f, data, self.axis)
File "/.venv/lib/python3.9/site-packages/pandas/core/groupby/ops.py", line 820, in apply
res = f(group)
File "/.venv/lib/python3.9/site-packages/pandas/core/groupby/groupby.py", line 936, in curried
return f(x, *args, **kwargs)
File "/.venv/lib/python3.9/site-packages/pandas/core/series.py", line 2355, in idxmax
i = self.argmax(axis, skipna, *args, **kwargs)
File "/.venv/lib/python3.9/site-packages/pandas/core/base.py", line 647, in argmax
return nanops.nanargmax( # type: ignore[return-value]
File "/.venv/lib/python3.9/site-packages/pandas/core/nanops.py", line 93, in _f
return f(*args, **kwargs)
File "/.venv/lib/python3.9/site-packages/pandas/core/nanops.py", line 1070, in nanargmax
result = values.argmax(axis) # type: ignore[var-annotated]
ValueError: attempt to get argmax of an empty sequence
I also can't seem to find a way to only select the groups which are not empty out of the grouped_df, and then run the idxmax()
only on those groups. Like if I could do something like this:
group_counts = grouped_df['rel_temp'].count()
group_not_empty = (group_counts > 0)
grouped_df_no_empty_groups = grouped_df[group_not_empty]
grouped_df_no_empty_groups['rel_temp'].idxmax()
However I can't seem to find a way to do this.
Edit: As per @andrej-kesely, there is a method by using apply; however this is extremely slow as it uses .apply(). I need to be able to process these rows within 20 seconds, which should be possible if there was a direct (non-apply based) method.
Any help is appreciated!