0

I'm scraping property ads with BS4, and use pandas to analyse the data.

In my DataFrame, rows represent property ads and columns represent property characteristics like rent, size, district, etc.

In a few property ads, the district names are incorrectly spelled, or even missing entirely. I would like to drop those property ads, i.e. I would like to drop the rows for which the district name is misspelled or missing.

I have a list containing the correct district names, e.g.

correct_districts=['North', 'South', 'West', 'East']

and I have a DataFrame city_df with a.o. a district column, e.g.

|  District | ....
 -----------------
|   North   | ....
|   South   | ....
|   Nort    | ....
|           | ....
|   West    | ....
|   ....    | ....

Checking this answer on conditional row selection, I did this,

city_df=city_df.loc[~city_df['District'].isin(correct_districts)]

However, this does not change anything in the District column. If I remove ~ and execute the command, I am left with only the rows for which is missing the district name.

What should I change to remove the rows for which the district names are either missing or misspelled?

Community
  • 1
  • 1
LucSpan
  • 1,831
  • 6
  • 31
  • 66
  • 2
    city_df = city_df.loc[city_df['District'].isin(correct_districts)] works fine. May be you executed the code with ~ which removed all the rows with correct districts in city_df. Try reloading city_df again, it should work – Vaishali Mar 19 '17 at 19:24
  • Thank you for the confirmation! I checked my correct_district lists and I overlooked some trailing whitespaces... Hence the strange dropping behaviour. It works now :) – LucSpan Mar 19 '17 at 19:36

0 Answers0