Dropping duplicate records ignoring case

Question

I have a data frame with with several columns, one of which is company_name. I'm trying to remove duplicate records based on them having the same company_name, but I'm at a loss on how to do this while maintaining the original case of the company_name.

I've tried converting all names to lowercase and then removing duplicates, but as mentioned before I would like to maintain the original case.

df = pd.DataFrame({'company_name': ['Apple', 'apple', 'BlackBerry', 'blackberry','Blackberry'])
df['company_name'] = df['company_name'].str.strip()
df['company_name'] = df['company_name'].str.lower()
input_records.drop_duplicates(subset=['company_id'], inplace=True, 
   keep="first")

this code gets rid of duplicates, but stores the company name in all lowercase.

Desired result

company_name
Apple
BlackBerry

Actual result

company_name
apple
blackberry

Welcome to StackOverflow. Please take the time to read this post on [how to provide a great pandas example](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) as well as how to provide a [minimal, Complete, and Verifiable example](https://stackoverflow.com/help/mcve) and revise your question accordingly. These tips on how to ask a good question may also be useful. — yatu, May 31 '19 at 14:06
Possible duplicate of [Remove duplicate rows from a pandas dataframe: Case Insenstive comparison](https://stackoverflow.com/questions/56125863/remove-duplicate-rows-from-a-pandas-dataframe-case-insenstive-comparison) — micric, May 31 '19 at 14:11

vlemaistre · Accepted Answer · 2019-05-31T15:13:50.163

5

You could create a new column where you do your operations. This will allow you to keep the company name column untouched.

Here is what it could look like :

df = pd.DataFrame({'company_name':['Apple','apple','Blackberry','blackberry']})

company_name
0        Apple
1        apple
2   Blackberry
3   blackberry

df['company_name_2'] = df['company_name'].str.strip()
df['company_name_2'] = df['company_name_2'].str.lower()
df.drop_duplicates(subset=['company_name_2'], inplace=True, 
   keep="first")
df.drop(columns = 'company_name_2', inplace = True)
df.reset_index(inplace = True, drop = True)

Out :

    company_name
0   Apple
1   Blackberry

edited May 31 '19 at 15:13

answered May 31 '19 at 14:15

vlemaistre

3,301
13
30

1

Thank you very much! This worked perfectly. I did change one thing however and that was changing df['company_name'].str.lower() to df['company_name_2].str.lower() – DataScience99 May 31 '19 at 14:29
You're right, it was a mistake from my part. I edited it, thanks for pointing it out – vlemaistre May 31 '19 at 15:14

score 3 · Answer 2 · answered May 31 '19 at 14:19

3

How about using duplicated

yourdf=df[~df.company_name.str.strip().str.lower().duplicated(keep='first')]
yourdf
Out[415]: 
  company_name
0        Apple
2   BlackBerry

answered May 31 '19 at 14:19

BENY

317,841
20
164
234

Dropping duplicate records ignoring case

2 Answers2