Unable to delete the duplicates in CSV

Question

"i have a data set in csv there it is a field name Episode where we will take data for future sport events we have"""INDIA VS PAKISTAN AND PAKISTAN VS INDIA for same date is there any option to delete the duplicate

Thanks in advance

enter image description here

Welcome to stack overflow! Unfortunately, this is not a code-writing or tutorial site, and we ask that you provide a [mcve] including sample input and output (as text in the question, not as a picture) and _code for what you've tried_ based on your own research. Please see [How to create good pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) for help on the input and output — G. Anderson, Nov 15 '19 at 20:11
[df.drop_duplicates](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html) — It_is_Chris, Nov 15 '19 at 20:18
In the last two lines, the text is not even the same in terms of words as `Unlv Rebels` appears in the line before the last and `Unlvrebels` in the last one. You should work on your dataset first, and then proceding to drop the duplicates. — Celius Stingher, Nov 15 '19 at 20:21
@Chris would `drop_duplicates` work here because he does have a column to the left that is unique, just with flipped text. — William Knighting, Nov 15 '19 at 20:21
Why does the data presented not match the terms in the question? We are to assume things are in a DF? The question needs to be restructured significantly to prove useful for future SO users. — Shawn Mehan, Nov 15 '19 at 20:23
@WilliamKnighting sort the characters in the string and drop the dups on that — It_is_Chris, Nov 15 '19 at 21:12

score 1 · Answer 1 · answered Nov 15 '19 at 20:19

One idea you could use would be to pandas rank method, group by the needed columns

df["RANK"] = df.groupby("Column_1")["Column_2"].rank(method="first", ascending=True)

This should return dataframe by grouping, so three rows of dupes should be ranked 1,2 and 3 respectively. From there, you can take the subset of the dataframe where rank=1 and this will give you a dataframe with no dupes.

It_is_Chris · Accepted Answer · 2019-11-15T21:20:40.137

0

Create a new match column then drop_duplicates

# sample df
df = pd.DataFrame({'a': [1,1,1,1,1],
                   'b': ['Bulldogs at Aztecs', 'Aztecs at Bulldogs', 'Bearcats at Huskies', 'Huskies at Bearcats', 'something else']})

# list comprehension and sort words in string 
df['match'] = [' '.join(sorted(x.split())) for x in df['b'].values]

#    a                    b                match
# 0  1   Bulldogs at Aztecs   Aztecs Bulldogs at
# 1  1   Aztecs at Bulldogs   Aztecs Bulldogs at
# 2  1  Bearcats at Huskies  Bearcats Huskies at
# 3  1  Huskies at Bearcats  Bearcats Huskies at
# 4  1       something else       else something

# drop_duplicates
df.drop_duplicates(['a', 'match'], keep='first').drop(columns='match')

#    a                    b
# 0  1   Bulldogs at Aztecs
# 2  1  Bearcats at Huskies
# 4  1       something else

edited Nov 15 '19 at 21:20

answered Nov 15 '19 at 21:10

It_is_Chris

13,504
2
23
41

Hi chris could you please explain what happens on the second line. – Renganathan Rajagopal Nov 16 '19 at 21:12
@RengaNathan Sure, it is list comprehension. So, for every value in the column `df['b']` you use `split` which splits each string on `' '` (space) and creates a list of all the words in the string. For example `['Bulldogs', 'at', 'Aztecs']` then you call the built-in `sorted` function on that which sorts a strings in the new list then you `join` the strings in the now sorted list together for form a new string. Then assign the values to a new column. – It_is_Chris Nov 16 '19 at 21:17
For more clarification you can run these three line separately to see what is going on: `'Bulldogs at Aztecs'.split()` then `sorted('Bulldogs at Aztecs'.split())` then `' '.join(sorted('Bulldogs at Aztecs'.split()))` – It_is_Chris Nov 16 '19 at 21:23

Unable to delete the duplicates in CSV

2 Answers2