2

I am trying to remove the repetitive/duplicate Names which is coming under NAME column. I just want to keep the 1st occurrence from the repetitive/duplicate names by using python script.

This is my input excel:

input

And need output like this:

output

sonali
  • 69
  • 7
  • 1
    Please don't post image data, you need to provide text so others can test your data in their editors to see what works. Check out asking a good question in pandas https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples – Umar.H Jul 06 '19 at 16:23
  • 1
    Why do you want to do that in pandas? is your sheet large? If it is not that large, you can easily do that with excel formulas (like placeing `IF(A2=A1;"":A2)` in a column behind your data, copying the formula down then copy the whole column and insert the values calculated by the fromula into column A. You could record this procedure as a macro, so you can reexecute it any time. – jottbe Jul 06 '19 at 16:26
  • My data is very large. Its not this much small and i want to do with python script to merge with other codes or stuffs. @jottbe – sonali Jul 06 '19 at 16:49
  • I see. The size alone wouldn't be an issue. I mean, you get your data in an excel sheet, so you could process it there, but if you do other stuff with the excel using python anyways, that's a point. – jottbe Jul 06 '19 at 18:12

1 Answers1

3

This isn't removing duplicates per say you're just filling duplicate keys in one column as blanks, I would handle this as follows :

by creating a mask where you return a true/false boolean if the row is == the row above.

assuming your dataframe is called df

mask = df['NAME'].ne(df['NAME'].shift())

df.loc[~mask,'NAME'] = ''

explanation :

what we are doing above is the following,

first selecting a single column, or in pandas terminology a series, we then apply a .ne (not equal to) which in effect is !=

lets see this in action.

import pandas as pd
import numpy as np
# create data for dataframe
names = ['Rekha', 'Rekha','Jaya','Jaya','Sushma','Nita','Nita','Nita']
defaults = ['','','c-default','','','c-default','','']
classes = ['forth','third','foruth','fifth','fourth','third','fifth','fourth']

now, lets create a dataframe similar to yours.

df = pd.DataFrame({'NAME' : names,
         'DEFAULT' : defaults,
         'CLASS' : classes,
         'AGE' : [np.random.randint(1,5) for len in names], 
         'GROUP' : [np.random.randint(1,5) for len in names]}) # being lazy with your age and group variables. 

so, if we did df['NAME'].ne('Omar') which is the same as [df['NAME'] != 'Omar'] we would get.

0    True
1    True
2    True
3    True
4    True
5    True
6    True
7    True

so, with that out of the way, we want to see if the name in row 1 (remember python is a 0 index language so row 1 is actually the 2nd physical row) is .eq to the row above.

we do this by calling [.shift][2] hyperlinked for more info.

what this basically does is shift the rows by its index with a defined variable number, lets call this n.

if we called df['NAME'].shift(1)

0       NaN
1     Rekha
2     Rekha
3      Jaya
4      Jaya
5    Sushma
6      Nita
7      Nita

we can see here that that Rekha has moved down

so putting that all together,

df['NAME'].ne(df['NAME'].shift())
0     True
1    False
2     True
3    False
4     True
5     True
6    False
7    False

we assign this to a self defined variable called mask you could call this whatever you want.

we then use [.loc][2] which lets you access your dataframe by labels or a boolean array, in this instance an array.

however, we only want to access the booleans which are False so we use a ~ which inverts the logic of our array.

    NAME    DEFAULT CLASS   AGE GROUP
1   Rekha       third   1   4
3   Jaya        fifth   1   1
6   Nita        fifth   1   2
7   Nita        fourth  1   4

all we need to do now is change these rows to blanks as your initial requirment, and we are left with.

    NAME    DEFAULT   CLASS AGE GROUP
0   Rekha             forth 2   2
1                     third 1   4
2   Jaya    c-default forth 3   3
3                     fifth 1   1
4   Sushma            fourth3   1
5   Nita    c-default third 4   2
6                     fifth 1   2
7                     fourth1   4

hope that helps!

Umar.H
  • 22,559
  • 7
  • 39
  • 74
  • awesome. its working. can you please explain this in brief.. means what is .ne, .loc, ~mask. – sonali Jul 06 '19 at 16:41
  • @srish will do as soon as I'm home ! Thanks for your green tick . – Umar.H Jul 06 '19 at 17:55
  • 1
    @srish edited for you, hope that helps. Happy coding! – Umar.H Jul 06 '19 at 19:21
  • Its really very good explanation. Thank you so much for your time and this explanation, it helped me alot. If its in my hand i must have given you more than one green tick and upvotes :-) :-D – sonali Jul 06 '19 at 20:15
  • 1
    you are most welcome, and don't worry, I learnt a lot by reading the docs to make the answer. Good luck, and make sure you provide a textual dataframe (like the one I gave above) if you ask another question. – Umar.H Jul 06 '19 at 20:28
  • yes i will remember to use textual dataframe when i ask another question. Thanks:-) – sonali Jul 06 '19 at 20:37