0

I have a dataframe and one of the columns is a city name. To check if I have duplicates values I make a df_hotels['city_name'].value_counts().sort_values. When I display the results I can see that I have duplicates values because of an empty character on the left of some cities.You can check that. (normaly I have a count of 25 for each line)

The problem is that when I try to make a df_hotels['city_name'] = df_hotels['city_name'].str.strip() (or lstrip) it doesn't work, the empty character on the left is still there.

FYI a to give the context : the column type is a object and I have created the dataframe form a json with a simple pd.read_json.

Thanks for you help.

valskyyy
  • 3
  • 2
  • Does this answer your question? [Pandas - Strip white space](https://stackoverflow.com/questions/43332057/pandas-strip-white-space) – Ture Pålsson Sep 24 '21 at 07:29

1 Answers1

1

you can use the dropna function to remove duplicate, as explained in the documetation (link).

if you want to apply a function on a column using pandas, you need to use the apply method, and in some cases a lambda function as well. here is an example:

df_hotels['city_name'] = df_hotels['city_name'].apply(lambda x: x.str.strip())
Guyblublu
  • 26
  • 3
  • Hi ! Thanks but I don't want to drop the duplicates, I need to keep them. If I have 10 "Paris" and 15 " Paris", I want 25 "Paris". I tried the apply method but the empty space is still there... – valskyyy Sep 24 '21 at 08:23
  • I don't know why it didn't work for you, you are welcome to send the data set and I'll try to help. – Guyblublu Sep 24 '21 at 08:42
  • here is the link to the csv, exported from the dataframe : https://drive.google.com/file/d/1cLDetz00W4JKykPCCWvbJpSbnkVsGLI6/view?usp=sharing – valskyyy Sep 24 '21 at 09:59
  • this will work for you: import pandas as pd df_hotels = pd.read_csv('data.csv') df_hotels['city_name'] = df_hotels['city_name'].astype('string') # change the type of the column print(df_hotels.loc[df_hotels['city_name'].apply(lambda x: x.startswith(' '))]) # print where city_name starts with ' ' df_hotels['city_name'] = df_hotels['city_name'].apply(lambda x: x.strip()) # remove the ' ' print(df_hotels.loc[df_hotels['city_name'].apply(lambda x: x.startswith(' '))]) # print where city_name starts with ' ' – Guyblublu Sep 24 '21 at 10:16
  • Works perfectly ! thanks a lot @Guyblublu ! – valskyyy Sep 24 '21 at 20:53