Get rid of duplicate rows on conditions for a pandas dataframe

Question

I have a dataframe with lots of duplicated rows on index, like this:

olddf = pd.DataFrame(index=['MJ','MJ','MJ','BJ','KJ','KJ'],data={'name':['masdjsdf','machael jordon','mskkkadke','boris johnson', 'kim jongun', 'kkasdfl'],'age':[23,40,31,35,25,30]})

I need to get rid of the duplicate index(rows) which also don't match the dictionary dic = {'MJ':'machael jordon', 'BJ':'boris johnson', 'KJ':'kim jongun'}. So after the operation, the dataframe should become

newdf = pd.DataFrame(index=['MJ','BJ','KJ'],data={'name':['machael jordon','boris johnson', 'kim jongun',],'age':[40,35,25]})

Thank you...

mozway · Accepted Answer · 2022-01-02T06:28:51.857

3

Use map to set the values from the dictionary keys, then eq to compare with the column's data. If equal this yields True, you can use the resulting Series of booleans as a mask to slice the original dataframe:

mask = olddf['name'].eq(olddf.index.map(dic))

newdf = olddf[mask]

Output:

              name  age
MJ  machael jordon   40
BJ   boris johnson   35
KJ      kim jongun   25

also keeping the non duplicated rows

Simple, add a second mask:

mask2 = ~olddf.index.duplicated(keep=False)

newdf = olddf[mask|mask2]

edited Jan 02 '22 at 06:28

answered Jan 02 '22 at 05:41

mozway

194,879
13
39
75

1

Hi, thank you for the solution! Works great to my question. I however find that I might need some change to my original problem. Because I do not have all the index entry of olddf available in the dictionary. The dictionary is there only to help solve duplications. So if I use your method, I find that entries which don't exist in dictionary ( and don't have duplicates) were dropped. As an example, say if in my original problem, the dictionary doesn't have "BJ":"boris johnson" in it, I still need it to stay in the dataframe after the operation as it doesn't have a duplicate either. – Alex Jan 02 '22 at 06:10
@Alex see update ;) – mozway Jan 02 '22 at 06:29
How is the holiday going @mozway? :-) Map is cumbersome and can be quite slow. not handy in this situation I guess – wwnde Jan 02 '22 at 07:03
@wwnde doing fine thanks ;) I wish you a happy new year. The issue with `isin` is that it will only check the values, ignoring the key (imagine a case where KJ would have a Boris Johnson value) – mozway Jan 02 '22 at 07:18
Thanks again. I can only say it works great at the moment as I have not used 'map' before. Time to digest the solution. Nice day! – Alex Jan 02 '22 at 08:17

wwnde · Answer 2 · 2022-01-02T07:42:44.553

0

Take it easy, make it easy. Make the dic a pd.Series and check inclusion using .isin(). Code below

lst={'MJ':'machael jordon', 'BJ':'boris johnson', 'KJ':'kim jongun'}

olddf[olddf.isin(pd.Series(lst, name='name')).any(1)]

Alternatively create a dataframe dict. append the name column to index on both new and old df and then merge on index. left or inner merge would do. Code below

pd.DataFrame(pd.Series(lst, name='name')).set_index('name', append=True).merge(olddf.set_index('name', append=True), how='left',left_index=True, right_index=True )

Outcome

            name  age
MJ  machael jordon   40
BJ   boris johnson   35
KJ      kim jongun   25

edited Jan 02 '22 at 07:42

answered Jan 02 '22 at 07:00

wwnde

26,119
6
18
32

I think you meant `olddf['name'].isin(pd.Series(dic, name='name'))` not `any` that would use the age column ;) – mozway Jan 02 '22 at 07:19
Yes, but we are associating by index and name to filter. Moreless an inner merge on index and name as I see it @mozway – wwnde Jan 02 '22 at 07:21
1

I see, then use `olddf[['name']].isin(pd.Series(dic, name='name'))['name']` this prevents uselessly (and perhaps wrongly, in case there is a collision) checking the other columns – mozway Jan 02 '22 at 07:39
Okay, we can then try `pd.DataFrame(pd.Series(lst, name='name')).set_index('name', append=True).merge(olddf.set_index('name', append=True), how='left',left_index=True, right_index=True )` I would prefer this to map. – wwnde Jan 02 '22 at 07:40
1

No sure why we got DV, but this made me realize I forgot to UV your answer ;) – mozway Jan 02 '22 at 08:27
Not sure too gave you my +1 – wwnde Jan 02 '22 at 08:49
Hey just wanna say thank you for the answer. I tried your first solution but it could not keep the index entry that's not in the dictionary. The second solution did not work in my working problem. I'll go with mozway's solution. :) – Alex Jan 02 '22 at 13:58
I do not see why it should not work. There is no technical explanation as to why. Anyway you seem to have go a solution. Happy coding. Keep coding – wwnde Jan 03 '22 at 01:12

score 0 · Answer 3 · answered Dec 06 '22 at 05:58

dic = {'MJ':'machael jordon', 'BJ':'boris johnson', 'KJ':'kim jongun'}
olddf.join(pd.Series(dic).to_frame('name'),rsuffix='_2').query("name==name_2")

         name  age          name_2
BJ   boris johnson   35   boris johnson
KJ      kim jongun   25      kim jongun
MJ  machael jordon   40  machael jordon

Get rid of duplicate rows on conditions for a pandas dataframe

3 Answers3

also keeping the non duplicated rows