How to get the index of duplicated values, while dropping NAs? Resulting index is smaller than original dataframe

Question

I'm working with a df:

df.shape[0]

82208

And I want to index duplicates based on firstname, lastname and email:

indx = (df.dropna(subset=['firstname', 'lastname', 'email'])
       .duplicated(subset=['firstname', 'lastname', 'email'], keep=False))


indx

0         True
1         True
2        False
3        False
4         True
5         True

indx.shape[0]

73797

I am unable to use this against the original df using df[indx] as they do not match in size as you can see from .shape[0]. I tried to use indx.index also, but I get:


df[indx.indx]

KeyError: "None of [Int64Index([    0,     1,     2,     3,     4,     5,     6,     7,     8,\n                9,\n            ...\n            82198, 82199, 82200, 82201, 82202, 82203, 82204, 82205, 82206,\n            82207],\n           dtype='int64', length=73797)] are in the [columns]"

I know it's something very simple, I just can't figure it out. It seems the indx I generate resets its index. What I'm trying to get is an index of where there are dupes in the first df. I'm guessing my problem has something to do with the dropna() when generating the index.

edit: It was suggested to check out a duplicate post, but this doesn't answer my question.The duplicate is just basic indexing.

My problem is that in generating the new index / boolean series 'indx', the original df indexes are lost. So it can't be used to index the df.

edit: another solution for this is reindexing so it matches the size of the df.


df = pd.DataFrame({'firstname':['stack','Bar Bar',np.nan,'Bar Bar','john','mary','jim'],
                   'lastname':['jim','Bar','Foo Bar','Bar','con','sullivan','Ryan'],
                   'email':[np.nan,'Bar','Foo Bar','Bar','john@com','mary@com','Jim@com']})

print(df)

  firstname  lastname     email
0     stack       jim       NaN
1   Bar Bar       Bar       Bar
2       NaN   Foo Bar   Foo Bar
3   Bar Bar       Bar       Bar
4      john       con  john@com
5      mary  sullivan  mary@com
6       jim      Ryan   Jim@com


indx = (df.dropna(subset=['firstname', 'lastname', 'email'])
                 .duplicated(subset=['firstname', 'lastname', 'email'], keep=False))

indx = indx.reindex(df.index, fill_value=False)


df[indx ]

  firstname lastname email
1   Bar Bar      Bar   Bar
3   Bar Bar      Bar   Bar

one more thing, if you dont want nan values, why not return it as `False` instead of dropping them: `indx=(~df[['firstname', 'lastname', 'email']].isna().any(1)& df.duplicated(subset=['firstname', 'lastname', 'email'], keep=False))` — anky, Jul 08 '19 at 14:36
Thank you, that seems to work, just doing a quick check it in a csv. can you figure out any reason why my version didn't work? — SCool, Jul 08 '19 at 15:07
So in that case its not a dupe. If you can confirm, I will post an answer (prob with your version is `.dropna()`- you loose out on indexes which is not a good way when doing a boolean indexing — anky, Jul 08 '19 at 15:08
Hi @anky_91, your version works for me. Checked the CSV and they are all duplicates which is the expected result. — SCool, Jul 08 '19 at 16:14

score 1 · Accepted Answer · answered Jul 08 '19 at 16:24

Instead of dropping the nans and then creating the boolean mask, add to the boolean mask that returns False for nan so you have all indexes retained, but false for nans. using df.isna() and df.any() for axis=1 we can use the below:

cols=['firstname', 'lastname', 'email']
index=(~df[cols].isna().any(1)&df.duplicated(subset=cols, keep=False))

How to get the index of duplicated values, while dropping NAs? Resulting index is smaller than original dataframe

1 Answers1