Pandas - Replace Duplicates with Nan and Keep Row

Question

How do I replace duplicates for each group with NaNs while keeping the rows?

I need to keep rows without removing and perhaps keeping the first original value where it shows up first.

import pandas as pd
from datetime import timedelta

df = pd.DataFrame({
    'date': ['2019-01-01 00:00:00','2019-01-01 01:00:00','2019-01-01 02:00:00', '2019-01-01 03:00:00',
             '2019-09-01 02:00:00','2019-09-01 03:00:00','2019-09-01 04:00:00', '2019-09-01 05:00:00'],
    'value': [10,10,10,10,12,12,12,12],
    'ID': ['Jackie','Jackie','Jackie','Jackie','Zoop','Zoop','Zoop','Zoop',]
})

df['date'] = pd.to_datetime(df['date'], infer_datetime_format=True)


date    value   ID
0   2019-01-01 00:00:00 10  Jackie
1   2019-01-01 01:00:00 10  Jackie
2   2019-01-01 02:00:00 10  Jackie
3   2019-01-01 03:00:00 10  Jackie
4   2019-09-01 02:00:00 12  Zoop
5   2019-09-01 03:00:00 12  Zoop
6   2019-09-01 04:00:00 12  Zoop
7   2019-09-01 05:00:00 12  Zoop

Desired Dataframe:

date    value   ID
0   2019-01-01 00:00:00 10  Jackie
1   2019-01-01 01:00:00 NaN Jackie
2   2019-01-01 02:00:00 NaN Jackie
3   2019-01-01 03:00:00 NaN Jackie
4   2019-09-01 02:00:00 12  Zoop
5   2019-09-01 03:00:00 NaN Zoop
6   2019-09-01 04:00:00 NaN Zoop
7   2019-09-01 05:00:00 NaN Zoop

Edit:

Duplicated values should only be dropped on the same date indifferent of the frequency. So if value 10 shows up on twice on Jan-1 and three times on Jan-2, the value 10 should only show up once on Jan-1 and once on Jan-2.

Andy L. · Accepted Answer · 2020-10-08T20:09:53.010

10

I assume you check duplicates on columns value and ID and further check on date of column date

df.loc[df.assign(d=df.date.dt.date).duplicated(['value','ID', 'd']), 'value'] = np.nan

Out[269]:
                 date  value      ID
0 2019-01-01 00:00:00   10.0  Jackie
1 2019-01-01 01:00:00    NaN  Jackie
2 2019-01-01 02:00:00    NaN  Jackie
3 2019-01-01 03:00:00    NaN  Jackie
4 2019-09-01 02:00:00   12.0    Zoop
5 2019-09-01 03:00:00    NaN    Zoop
6 2019-09-01 04:00:00    NaN    Zoop
7 2019-09-01 05:00:00    NaN    Zoop

As @Trenton suggest, you may use pd.NA to avoid import numpy

(Note: as @rafaelc sugguest: here is the link explain detail differences between pd.NA and np.nan https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.0.0.html#experimental-na-scalar-to-denote-missing-values)

df.loc[df.assign(d=df.date.dt.date).duplicated(['value','ID', 'd']), 'value'] = pd.NA

Out[273]:
                 date value      ID
0 2019-01-01 00:00:00    10  Jackie
1 2019-01-01 01:00:00  <NA>  Jackie
2 2019-01-01 02:00:00  <NA>  Jackie
3 2019-01-01 03:00:00  <NA>  Jackie
4 2019-09-01 02:00:00    12    Zoop
5 2019-09-01 03:00:00  <NA>    Zoop
6 2019-09-01 04:00:00  <NA>    Zoop
7 2019-09-01 05:00:00  <NA>    Zoop

edited Oct 08 '20 at 20:09

answered Oct 08 '20 at 19:50

Andy L.

24,909
4
17
29

3

You can use `pd.NA` so importing numpy isn't required just for `np.nan`. – Trenton McKinney Oct 08 '20 at 19:54
Duplicated values should only be dropped on the same date indifferent of the frequency. So if value 10 shows up on twice on Jan-1 and three times on Jan-2, the value 10 should only show up once on Jan-1 and once on Jan-2. – Starbucks Oct 08 '20 at 19:56
3

Probably worth adding that `pd.NA` and `np.nan` can have very different behavior. They're not necessarily interchangeable – rafaelc Oct 08 '20 at 19:59
2

@rafaelc: added the link explaining the differences between `pd.NA` and `np.nan` – Andy L. Oct 08 '20 at 20:07
2

I was just playing with `pd.NA`, and realized when creating a column of `ints` or `floats`, `pd.NA` causes the column to be `object` type. Likewise, `pd.NA` forced the `type` of `'value'` to become `object`. Therefore, I withdraw my suggestion. – Trenton McKinney Oct 08 '20 at 22:42

score 1 · Answer 2 · answered Oct 08 '20 at 19:48

This is working if the dataframe is sorted - as in your example:

import numpy as np                                    # to be used for np.nan

df['duplicate'] = df['value'].shift(1)                # create a duplicate column 
df['value'] = df.apply(lambda x: np.nan if x['value'] == x['duplicate'] \
                          else x['value'], axis=1)    # conditional replace
df = df.drop('duplicate', axis=1)                     # drop helper column

score 1 · Answer 3 · answered Oct 08 '20 at 20:03

Group on the dates and take the first observed value (not necessarily the first when sorted by time), then merge the result back to the original dataframe.

df2 = df.groupby([df['date'].dt.date, 'ID'], as_index=False).first()
>>> df.drop(columns='value').merge(df2, on=['date', 'ID'], how='left')[df.columns]
                 date  value      ID
0 2019-01-01 00:00:00   10.0  Jackie
1 2019-01-01 01:00:00    NaN  Jackie
2 2019-01-01 02:00:00    NaN  Jackie
3 2019-01-01 03:00:00    NaN  Jackie
4 2019-09-01 02:00:00   12.0    Zoop
5 2019-09-01 03:00:00    NaN    Zoop
6 2019-09-01 04:00:00    NaN    Zoop
7 2019-09-01 05:00:00    NaN    Zoop

Pandas - Replace Duplicates with Nan and Keep Row

3 Answers3

Linked

Related