I'm working with a df:
df.shape[0]
82208
And I want to index duplicates based on firstname, lastname and email:
indx = (df.dropna(subset=['firstname', 'lastname', 'email'])
.duplicated(subset=['firstname', 'lastname', 'email'], keep=False))
indx
0 True
1 True
2 False
3 False
4 True
5 True
indx.shape[0]
73797
I am unable to use this against the original df using df[indx]
as they do not match in size as you can see from .shape[0]
. I tried to use indx.index
also, but I get:
df[indx.indx]
KeyError: "None of [Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8,\n 9,\n ...\n 82198, 82199, 82200, 82201, 82202, 82203, 82204, 82205, 82206,\n 82207],\n dtype='int64', length=73797)] are in the [columns]"
I know it's something very simple, I just can't figure it out. It seems the indx
I generate resets its index. What I'm trying to get is an index of where there are dupes in the first df. I'm guessing my problem has something to do with the dropna()
when generating the index.
edit: It was suggested to check out a duplicate post, but this doesn't answer my question.The duplicate is just basic indexing.
My problem is that in generating the new index / boolean series 'indx'
, the original df
indexes are lost. So it can't be used to index the df
.
edit: another solution for this is reindexing so it matches the size of the df.
df = pd.DataFrame({'firstname':['stack','Bar Bar',np.nan,'Bar Bar','john','mary','jim'],
'lastname':['jim','Bar','Foo Bar','Bar','con','sullivan','Ryan'],
'email':[np.nan,'Bar','Foo Bar','Bar','john@com','mary@com','Jim@com']})
print(df)
firstname lastname email
0 stack jim NaN
1 Bar Bar Bar Bar
2 NaN Foo Bar Foo Bar
3 Bar Bar Bar Bar
4 john con john@com
5 mary sullivan mary@com
6 jim Ryan Jim@com
indx = (df.dropna(subset=['firstname', 'lastname', 'email'])
.duplicated(subset=['firstname', 'lastname', 'email'], keep=False))
indx = indx.reindex(df.index, fill_value=False)
df[indx ]
firstname lastname email
1 Bar Bar Bar Bar
3 Bar Bar Bar Bar