1

I have real estate properties and their details (17 columns) in a CSV file (nearly half a million entries). One of the columns provides a location but is actually somewhat a bit too detailed. I want to categorize my entries so I want to simplify the location to give me more generic areas. I would have the areas I want to categorize the entries into in a list such as:

keywords = ['Downtown','Park View','Industrial District', ... ]

So ideally I would like to take an entry that has for example Sky Tower Downtown Los Angeles and then classify it as Downtown.

So the task is to first detect the keyword in the location column and then append it to a new column (right beside it if possible). If no keyword is found in the entry, I would to classify it as Other.

It would look something like this:

Date Record_Type Location Proterty_Type ... Price
19-Mar-21 Active Listing Sky Tower Downtown Los Angeles Apartment ... 15000
19-Mar-21 Active Listing Central Park Residential Tower, 5th Avenue Apartment ... 17000
20-Mar-21 Active Listing Meadow Gardens, Park View Villa ... 125000

To something like:

Date Record_Type Location Area Proterty_Type ... Price
19-Mar-21 Active Listing Sky Tower Downtown Los Angeles Downtown Apartment ... 15000
19-Mar-21 Active Listing Central Park Residential Tower, 5th Avenue Other Apartment ... 17000
20-Mar-21 Active Listing Meadow Gardens, Park View Park View Villa ... 125000

Finally it saves it all to a new csv file. I would also ideally like yo use pandas to read/write on the csv.

Thanks in advance!

Edit: I have tried methods such as the following threads, but I get errors and I don't know whats wrong, so Im open to fresh ideas.

How to append a new column to a CSV file using Python?

Adding new column to CSV in Python

Mahmood
  • 45
  • 8
  • How would you classify for example `Meadow Gardens, Park View, Downtown`? Park View or Downtown? – Andrej Kesely Apr 05 '21 at 21:51
  • This is far too broad. What have you tried, and what went wrong with your attempts? For example, pandas has the `read_csv()` and `Series.str.contains()` methods which seem a good place to start – G. Anderson Apr 05 '21 at 21:54
  • Ideally, most cases do not have multiple keywords in their string, but if I do< would take the last keyword. If that is tricky, either the first keyword detected or classify it as `Other` and I can manually check it later. – Mahmood Apr 05 '21 at 21:55
  • @G.Anderson I am using `df = pd.read_csv(r'listings.csv', names=col_names, skiprows=[0])` to import the data and `col_names` is from a list I defined before importing. – Mahmood Apr 05 '21 at 21:56
  • 1
    That would be good detail to [edit] into your question to make it a [mcve]. The more specific you can be with where you are now, the better we can help you – G. Anderson Apr 05 '21 at 22:03
  • Sure, ill be sure to edit it now. – Mahmood Apr 05 '21 at 22:07

1 Answers1

1

If you have this datafame:

        Date     Record_Type                                    Location Proterty_Type   Price
0  19-Mar-21  Active Listing              Sky Tower Downtown Los Angeles     Apartment   15000
1  19-Mar-21  Active Listing  Central Park Residential Tower, 5th Avenue     Apartment   17000
2  20-Mar-21  Active Listing                   Meadow Gardens, Park View         Villa  125000

Then:

keywords = ["Downtown", "Park View", "Industrial District"]

df.insert(
    loc=3,
    column="Area",
    value=df["Location"].apply(
        lambda x: next((kw for kw in keywords if kw in x), "Other")
    ),
)
print(df)

Creates Area column next to Location and prints:

        Date     Record_Type                                    Location       Area Proterty_Type   Price
0  19-Mar-21  Active Listing              Sky Tower Downtown Los Angeles   Downtown     Apartment   15000
1  19-Mar-21  Active Listing  Central Park Residential Tower, 5th Avenue      Other     Apartment   17000
2  20-Mar-21  Active Listing                   Meadow Gardens, Park View  Park View         Villa  125000
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • 1
    Works perfectly, thank you for that!, any place where you would recommend i check the documenntation for `lambda`? – Mahmood Apr 06 '21 at 07:02
  • 1
    @MahmoudAlQadi `lambda` is standard Python syntax, you can start at official Python docs: https://docs.python.org/3/tutorial/controlflow.html#lambda-expressions – Andrej Kesely Apr 06 '21 at 07:10
  • trying this on a separate dateset, i got an error `TypeError: argument of type 'float' is not iterable` what could be causing this? How would I make it skip floats? – Mahmood Apr 06 '21 at 08:02
  • 1
    @MahmoudAlQadi Shouldn't happen, is your column of type `string`? you can do before `df['Location'] = df['Location'].astype(str)` – Andrej Kesely Apr 06 '21 at 08:04
  • they are strings but as there are so many values I haven't manually checked it all. your fix works tho, thank you! – Mahmood Apr 06 '21 at 08:07