Removing repetitive/duplicate occurance in excel using python

Question

I am trying to remove the repetitive/duplicate Names which is coming under NAME column. I just want to keep the 1st occurrence from the repetitive/duplicate names by using python script.

This is my input excel:

And need output like this:

Please don't post image data, you need to provide text so others can test your data in their editors to see what works. Check out asking a good question in pandas https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples — Umar.H, Jul 06 '19 at 16:23
Why do you want to do that in pandas? is your sheet large? If it is not that large, you can easily do that with excel formulas (like placeing `IF(A2=A1;"":A2)` in a column behind your data, copying the formula down then copy the whole column and insert the values calculated by the fromula into column A. You could record this procedure as a macro, so you can reexecute it any time. — jottbe, Jul 06 '19 at 16:26
My data is very large. Its not this much small and i want to do with python script to merge with other codes or stuffs. @jottbe — sonali, Jul 06 '19 at 16:49
I see. The size alone wouldn't be an issue. I mean, you get your data in an excel sheet, so you could process it there, but if you do other stuff with the excel using python anyways, that's a point. — jottbe, Jul 06 '19 at 18:12

Umar.H · Accepted Answer · 2019-07-06T19:21:32.230

This isn't removing duplicates per say you're just filling duplicate keys in one column as blanks, I would handle this as follows :

by creating a mask where you return a true/false boolean if the row is == the row above.

assuming your dataframe is called df

mask = df['NAME'].ne(df['NAME'].shift())

df.loc[~mask,'NAME'] = ''

explanation :

what we are doing above is the following,

first selecting a single column, or in pandas terminology a series, we then apply a .ne (not equal to) which in effect is !=

lets see this in action.

import pandas as pd
import numpy as np
# create data for dataframe
names = ['Rekha', 'Rekha','Jaya','Jaya','Sushma','Nita','Nita','Nita']
defaults = ['','','c-default','','','c-default','','']
classes = ['forth','third','foruth','fifth','fourth','third','fifth','fourth']

now, lets create a dataframe similar to yours.

df = pd.DataFrame({'NAME' : names,
         'DEFAULT' : defaults,
         'CLASS' : classes,
         'AGE' : [np.random.randint(1,5) for len in names], 
         'GROUP' : [np.random.randint(1,5) for len in names]}) # being lazy with your age and group variables.

so, if we did df['NAME'].ne('Omar') which is the same as [df['NAME'] != 'Omar'] we would get.

0    True
1    True
2    True
3    True
4    True
5    True
6    True
7    True

so, with that out of the way, we want to see if the name in row 1 (remember python is a 0 index language so row 1 is actually the 2nd physical row) is .eq to the row above.

we do this by calling [.shift][2] hyperlinked for more info.

what this basically does is shift the rows by its index with a defined variable number, lets call this n.

if we called df['NAME'].shift(1)

0       NaN
1     Rekha
2     Rekha
3      Jaya
4      Jaya
5    Sushma
6      Nita
7      Nita

we can see here that that Rekha has moved down

so putting that all together,

df['NAME'].ne(df['NAME'].shift())
0     True
1    False
2     True
3    False
4     True
5     True
6    False
7    False

we assign this to a self defined variable called mask you could call this whatever you want.

we then use [.loc][2] which lets you access your dataframe by labels or a boolean array, in this instance an array.

however, we only want to access the booleans which are False so we use a ~ which inverts the logic of our array.

    NAME    DEFAULT CLASS   AGE GROUP
1   Rekha       third   1   4
3   Jaya        fifth   1   1
6   Nita        fifth   1   2
7   Nita        fourth  1   4

all we need to do now is change these rows to blanks as your initial requirment, and we are left with.

    NAME    DEFAULT   CLASS AGE GROUP
0   Rekha             forth 2   2
1                     third 1   4
2   Jaya    c-default forth 3   3
3                     fifth 1   1
4   Sushma            fourth3   1
5   Nita    c-default third 4   2
6                     fifth 1   2
7                     fourth1   4

hope that helps!

awesome. its working. can you please explain this in brief.. means what is .ne, .loc, ~mask. — sonali, Jul 06 '19 at 16:41
@srish will do as soon as I'm home ! Thanks for your green tick . — Umar.H, Jul 06 '19 at 17:55
Its really very good explanation. Thank you so much for your time and this explanation, it helped me alot. If its in my hand i must have given you more than one green tick and upvotes :-) :-D — sonali, Jul 06 '19 at 20:15
you are most welcome, and don't worry, I learnt a lot by reading the docs to make the answer. Good luck, and make sure you provide a textual dataframe (like the one I gave above) if you ask another question. — Umar.H, Jul 06 '19 at 20:28
yes i will remember to use textual dataframe when i ask another question. Thanks:-) — sonali, Jul 06 '19 at 20:37

Removing repetitive/duplicate occurance in excel using python

1 Answers1

Linked

Related