1

I'm working with a df:

df.shape[0]

82208

And I want to index duplicates based on firstname, lastname and email:

indx = (df.dropna(subset=['firstname', 'lastname', 'email'])
       .duplicated(subset=['firstname', 'lastname', 'email'], keep=False))


indx

0         True
1         True
2        False
3        False
4         True
5         True

indx.shape[0]

73797

I am unable to use this against the original df using df[indx] as they do not match in size as you can see from .shape[0]. I tried to use indx.index also, but I get:


df[indx.indx]

KeyError: "None of [Int64Index([    0,     1,     2,     3,     4,     5,     6,     7,     8,\n                9,\n            ...\n            82198, 82199, 82200, 82201, 82202, 82203, 82204, 82205, 82206,\n            82207],\n           dtype='int64', length=73797)] are in the [columns]"

I know it's something very simple, I just can't figure it out. It seems the indx I generate resets its index. What I'm trying to get is an index of where there are dupes in the first df. I'm guessing my problem has something to do with the dropna() when generating the index.

edit: It was suggested to check out a duplicate post, but this doesn't answer my question.The duplicate is just basic indexing.

My problem is that in generating the new index / boolean series 'indx', the original df indexes are lost. So it can't be used to index the df.

edit: another solution for this is reindexing so it matches the size of the df.


df = pd.DataFrame({'firstname':['stack','Bar Bar',np.nan,'Bar Bar','john','mary','jim'],
                   'lastname':['jim','Bar','Foo Bar','Bar','con','sullivan','Ryan'],
                   'email':[np.nan,'Bar','Foo Bar','Bar','john@com','mary@com','Jim@com']})

print(df)

  firstname  lastname     email
0     stack       jim       NaN
1   Bar Bar       Bar       Bar
2       NaN   Foo Bar   Foo Bar
3   Bar Bar       Bar       Bar
4      john       con  john@com
5      mary  sullivan  mary@com
6       jim      Ryan   Jim@com


indx = (df.dropna(subset=['firstname', 'lastname', 'email'])
                 .duplicated(subset=['firstname', 'lastname', 'email'], keep=False))

indx = indx.reindex(df.index, fill_value=False)


df[indx ]

  firstname lastname email
1   Bar Bar      Bar   Bar
3   Bar Bar      Bar   Bar
SCool
  • 3,104
  • 4
  • 21
  • 49
  • do you want `df.loc[indx]` ? – anky Jul 08 '19 at 14:30
  • Yes that would be good. I tried that but getting errors. – SCool Jul 08 '19 at 14:36
  • 1
    one more thing, if you dont want nan values, why not return it as `False` instead of dropping them: `indx=(~df[['firstname', 'lastname', 'email']].isna().any(1)& df.duplicated(subset=['firstname', 'lastname', 'email'], keep=False))` – anky Jul 08 '19 at 14:36
  • Thank you, that seems to work, just doing a quick check it in a csv. can you figure out any reason why my version didn't work? – SCool Jul 08 '19 at 15:07
  • So in that case its not a dupe. If you can confirm, I will post an answer (prob with your version is `.dropna()`- you loose out on indexes which is not a good way when doing a boolean indexing – anky Jul 08 '19 at 15:08
  • 1
    Hi @anky_91, your version works for me. Checked the CSV and they are all duplicates which is the expected result. – SCool Jul 08 '19 at 16:14

1 Answers1

1

Instead of dropping the nans and then creating the boolean mask, add to the boolean mask that returns False for nan so you have all indexes retained, but false for nans. using df.isna() and df.any() for axis=1 we can use the below:

cols=['firstname', 'lastname', 'email']
index=(~df[cols].isna().any(1)&df.duplicated(subset=cols, keep=False))
anky
  • 74,114
  • 11
  • 41
  • 70