0

Scenario. Assume a

  1. pd.DataFrame, loaded from an external source
  2. where one row is a line from a sensor. The index is a DateTimeIndex
  3. with some rows having df.index.duplicated()==True. This actually means, there are lines with the same timestamp from different sensors.

Now applying some logic, like df.loc[df.A>0, 'my_col'] = 1, I ran into ValueError: cannot reindex from a duplicate axis. This can be solved by simply removing the duplicated rows using

df[~df.index.duplicated()]

But I wonder, if it would be possible, to actually apply a column based function during the Index de-duplication process? E.g.: Calculating the mean/max/min of column A/B/C for the duplicated rows.

Is this possible? Its something like a groupby.aggregate on df.index.duplicated() rows.

gies0r
  • 4,723
  • 4
  • 39
  • 50
  • have you tried something like `df.groupby(df.index).mean()`? – Ben.T Jun 02 '20 at 22:28
  • or `df.groupby(level=0).mean()`? – Quang Hoang Jun 02 '20 at 22:40
  • Thank you both for your reply. That would in fact apply the `mean()` function on **all** columns, not only specific ones. E.g: If I would like to hold the `max` value on column `A` and the `mean` value for column `B`, that would not work. – gies0r Jun 02 '20 at 22:42

1 Answers1

0

Check with describe

df.groupby(level=0).describe()
BENY
  • 317,841
  • 20
  • 164
  • 234