4

I have a pandas dataframe like below: user id is a column which can contain duplicates. C1,C2,C3 are also columns.

I want to delete only those rows which has duplicated user column and have NaN for all values in C1,C2,C3 columns for those rows.

Expected output for this example: delete 1st row (user 1) as it has all NaN, but don't want to delete the row 3 (user 2) as it has only one instance (no duplicates). How can I accomplish it across all such rows?

user      C1         C2         C3

1       NaN        NaN        NaN
1       Nan        x          y
2       NaN        NaN        Nan
3        a          b         c
  • To clairfy, you're looking to drop duplicate id values whose rows are all NaN? and `id` is a column or the index? – Henry Ecker Jul 25 '21 at 01:17
  • 1
    @HenryEcker Yes, id is a column as well, and I want to delete the rows where columns C1,C2,C3 are all NaN, but only when it has other instances(rows) and have some non NaN in C1,C2,C2 in those rows. – Steve Smith Jul 25 '21 at 01:29
  • For the simpler case without duplicate `user`, see [this answer](https://stackoverflow.com/questions/25146277/pandas-delete-rows-with-only-nan-values)`df.dropna(thresh=3)` or `df.dropna(how='all')` – smci Jul 25 '21 at 01:50

4 Answers4

2

You can do it like this

# Getting the count of each id
res = dict(df['id'].value_counts())
res

def check(idx):
'''
If the value at the given index has all the column values as NULL and the
occurrence of that id is greater than 1 then we return False (we don't 
want this row) otherwise, we return True (we want this row).
'''
    if df.loc[idx, 'temp'] == 3 and res[df.loc[idx, 'id']] > 1:
        return False
    else:
        return True

# temp row
df['temp'] = np.sum(df.isna(), axis=1)
df['temp and dup'] = df.index.map(check)
# Now we just select the rows we want.
df = df[df['temp and dup'] == True]
df.drop(columns=['temp', 'temp and dup'], inplace=True)
df

if it wolved your problem then give the green tick.

Abhishek Prajapat
  • 1,793
  • 2
  • 8
  • 19
  • I think id might be creating confusion, so changed it to user .. and user is not an index column.. Also I want to perform this operation with all such rows. – Steve Smith Jul 25 '21 at 01:37
  • @SteveSmith I am using the index to just access the data frame at each row. Check carefully, I am not using an index instead of id/user. Just try to see the result of `df` after `df['temp and dup'] = df.index.map(check)` line. – Abhishek Prajapat Jul 25 '21 at 01:38
2

We can create an inclusive mask to keep rows where it is both not a duplicated id and cols C1, C2, and C3 are all NaN (isna):

df = df[~(df['user'].duplicated(keep=False) &
          df[['C1', 'C2', 'C3']].isna().all(axis=1))]

df:

   user   C1   C2   C3
1     1  NaN    x    y
2     2  NaN  NaN  NaN
3     3    a    b    c

If there are lots of columns loc can be used to select instead of listing them all.

  1. All columns after C1 (inclusive)
df = df[~(df['user'].duplicated(keep=False) &
          df.loc[:, 'C1':].isna().all(axis=1))]
  1. All columns between C1 and C3 (inclusive):
df = df[~(df['user'].duplicated(keep=False) &
          df.loc[:, 'C1':'C3'].isna().all(axis=1))]

Breakdown of the mask creation as a DataFrame:

breakdown_df = df.join(
    pd.DataFrame({
        'duplicated_id': df['user'].duplicated(keep=False),
        'all_nan': df[['C1', 'C2', 'C3']].isna().all(axis=1),
        'neither_nor': ~(df['user'].duplicated(keep=False) &
                         df[['C1', 'C2', 'C3']].isna().all(axis=1))
    })
)

breakdown_df:

   user   C1   C2   C3  duplicated_id  all_nan  neither_nor
0     1  NaN  NaN  NaN           True     True        False
1     1  NaN    x    y           True    False         True
2     2  NaN  NaN  NaN          False     True         True
3     3    a    b    c          False    False         True
  • The True neither_nor rows are the rows that are kept
Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
2

Something like dropna then reindex

out = df.set_index('user').dropna(how = 'all',axis = 0).reindex(df.user.unique()).reset_index()
Out[437]: 
   user   C1   C2   C3
0     1  NaN    x    y
1     2  NaN  NaN  NaN
2     3    a    b    c
BENY
  • 317,841
  • 20
  • 164
  • 234
1
df = pd.DataFrame({
    'user': [1, 1, 2, 3],
    'C1': [np.nan, np.nan, np.nan, 'a'],
    'C2': [np.nan, 'x', np.nan, 'b'],
    'C3': [np.nan, 'y', np.nan, 'c']
})

nanrows = df.iloc[:, 1:].isna().all(axis=1)
counts = df.user.value_counts()
dupes = df.user.map(counts) > 1

df[~(dupes & nanrows)]
   user   C1   C2   C3
1     1  NaN    x    y
2     2  NaN  NaN  NaN
3     3    a    b    c

EDIT:

The above solution has a subtle bug. If a user only exists in all NaN rows, they get dropped. You can see it with the following modified dataframe:

df = pd.DataFrame(
    {
        "user": [1, 1, 2, 3, 4, 4],
        "C1": [np.nan, np.nan, np.nan, "a", np.nan, np.nan],
        "C2": [np.nan, "x", np.nan, "b", np.nan, np.nan],
        "C3": [np.nan, "y", np.nan, "c", np.nan, np.nan],
    }
)

You can resolve this by finding the first occurrence of this kind of user and setting it to "not duplicate".

todrop = df.user[(dupes & nanrows)].drop_duplicates()
tokeep = df.user[~(dupes & nanrows)].drop_duplicates()

if todrop.isin(tokeep).all() == False:
    notinkeep = todrop[todrop.isin(tokeep) == False]
    duped_user = df.user[dupes]
    for user in notinkeep:
        for i, val in duped_user.items():
            if val in notinkeep:
                dupes.loc[i] = False
                break

The nested for-loop is a bit nasty, but I think it could be simplified with some work.

suvayu
  • 4,271
  • 2
  • 29
  • 35