Drop duplicated rows based on multiple columns if other column(s) is NaNs in Pandas

Question

Given a test dataset as follows:

   id city district  quantity  price
0   1   bj       hd      12.0   23.0
1   2   bj       cy      23.0   45.0
2   3   bj       hd       NaN    NaN
3   4   bj       cy       NaN    NaN
4   5   sh       hp      12.0    NaN
5   6   sh       hp       NaN    NaN
6   7   sh       pd       NaN    NaN

I would like to drop duplicated rows based on city and district, then drop rows if its quantity is NaN, but if city and district are not duplicated, then even if quantity is NaN, it's not necessary to drop rows.

Code based on link from here:

m1 = df['quantity'].notna()
m2 = ~df[['city', 'district']].duplicated()

df1 = df[m1 & m2]
print(df1)

Out:

   id city district  quantity  price
0   1   bj       hd      12.0   23.0
1   2   bj       cy      23.0   45.0
4   5   sh       hp      12.0    NaN

But I want to keep the last row since it's not duplicated with any rows. How could I do that?

   id city district  quantity  price
0   1   bj       hd      12.0   23.0
1   2   bj       cy      23.0   45.0
2   5   sh       hp      12.0    NaN
3   7   sh       pd       NaN    NaN

jezrael · Answer 1 · 2021-01-18T09:23:44.717

1

You can use:

m1 = df['quantity'].notna()
m2 = ~df[['city', 'district']].duplicated()
m3 = ~df[['city', 'district']].duplicated(keep=False)

df1 = df[(m1 & m2) | (~m1 & m3)]
print(df1)
   id city district  quantity  price
0   1   bj       hd      12.0   23.0
1   2   bj       cy      23.0   45.0
4   5   sh       hp      12.0    NaN
6   7   sh       pd       NaN    NaN

edited Jan 18 '21 at 09:23

answered Jan 18 '21 at 09:17

jezrael

822,522
95
1,334
1,252

Thanks but how does that ensure the rows droped are the ones whose `quantity` are `NaN`s? – ah bon Jan 18 '21 at 09:22
1

@ahbon - Added another condition for test all non duplicated values, with `keep=False` for get all non duplicated NaNs – jezrael Jan 18 '21 at 09:24

Sayandip Dutta · Accepted Answer · 2021-01-18T09:49:56.930

According to your logic, it seems like you want to drop columns where:

quantity is NaN => m1 = df['quantity'].isna() AND
'city', 'district' is duplicated => m2 = df[['city', 'district']].duplicated(keep=False)

And since you would like to take all columns except which meet the above conditions:

>>> m1 = df['quantity'].isna()
>>> m2 = df[['city', 'district']].duplicated(keep=False)
>>> df[~(m1 & m2)]

   id city district  quantity  price
0   1   bj       hd      12.0   23.0
1   2   bj       cy      23.0   45.0
4   5   sh       hp      12.0    NaN
6   7   sh       pd       NaN    NaN

And your original code would work with keep=False and | (or) operator.

>>> m1 = df['quantity'].notna()
>>> m2 = ~df[['city', 'district']].duplicated(keep=False)
>>> df[m1 | m2]
   id city district  quantity  price
0   1   bj       hd      12.0   23.0
1   2   bj       cy      23.0   45.0
4   5   sh       hp      12.0    NaN
6   7   sh       pd       NaN    NaN

EDIT

Based on your comments, if the df is:

>>> df
   id city district  quantity  price
0   1   bj       hd      12.0   23.0
1   2   bj       cy      23.0   45.0
2   3   bj       hd       NaN    NaN
3   4   bj       cy       NaN    NaN
4   8   sh       hp      14.0   15.0
5   8   sh       hp      14.0   16.0
6   7   sh       pd       NaN    NaN

# First drop duplicates with NaN items, with any of the above methods
>>> df[m1 | m2]
   id city district  quantity  price
0   1   bj       hd      12.0   23.0
1   2   bj       cy      23.0   45.0
4   8   sh       hp      14.0   15.0
5   8   sh       hp      14.0   16.0
6   7   sh       pd       NaN    NaN

# then drop duplicates with default condition:
>>> df[m1 | m2].drop_duplicates(['city', 'district'])
   id city district  quantity  price
0   1   bj       hd      12.0   23.0
1   2   bj       cy      23.0   45.0
4   8   sh       hp      14.0   15.0
6   7   sh       pd       NaN    NaN

You can change keep parameter of drop_duplicates to control default behavior, i.e. whether to keep the first duplicate or the last.

I'm trying to do is to find duplicated rows based on the `city` and `district` columns. If the `quantity` is `NaN`, the rows with `NaN` `quantity` will be deleted first (it is also possible to judge whether multiple columns are `NaN`s); if it is not `NaN`, drop duplicated rows according to the default conditions. — ah bon, Jan 18 '21 at 09:33
I see, then all you need to do is, add another `..drop_duplicates(['city', 'district'])` after any of the above solutions. — Sayandip Dutta, Jan 18 '21 at 09:44

Drop duplicated rows based on multiple columns if other column(s) is NaNs in Pandas

2 Answers2