Keyword Detection in CSV and add to new column

Question

I have real estate properties and their details (17 columns) in a CSV file (nearly half a million entries). One of the columns provides a location but is actually somewhat a bit too detailed. I want to categorize my entries so I want to simplify the location to give me more generic areas. I would have the areas I want to categorize the entries into in a list such as:

keywords = ['Downtown','Park View','Industrial District', ... ]

So ideally I would like to take an entry that has for example Sky Tower Downtown Los Angeles and then classify it as Downtown.

So the task is to first detect the keyword in the location column and then append it to a new column (right beside it if possible). If no keyword is found in the entry, I would to classify it as Other.

It would look something like this:

Date	Record_Type	Location	Proterty_Type	...	Price
19-Mar-21	Active Listing	Sky Tower Downtown Los Angeles	Apartment	...	15000
19-Mar-21	Active Listing	Central Park Residential Tower, 5th Avenue	Apartment	...	17000
20-Mar-21	Active Listing	Meadow Gardens, Park View	Villa	...	125000

To something like:

Date	Record_Type	Location	Area	Proterty_Type	...	Price
19-Mar-21	Active Listing	Sky Tower Downtown Los Angeles	Downtown	Apartment	...	15000
19-Mar-21	Active Listing	Central Park Residential Tower, 5th Avenue	Other	Apartment	...	17000
20-Mar-21	Active Listing	Meadow Gardens, Park View	Park View	Villa	...	125000

Finally it saves it all to a new csv file. I would also ideally like yo use pandas to read/write on the csv.

Thanks in advance!

Edit: I have tried methods such as the following threads, but I get errors and I don't know whats wrong, so Im open to fresh ideas.

How to append a new column to a CSV file using Python?

Adding new column to CSV in Python

How would you classify for example `Meadow Gardens, Park View, Downtown`? Park View or Downtown? — Andrej Kesely, Apr 05 '21 at 21:51
This is far too broad. What have you tried, and what went wrong with your attempts? For example, pandas has the `read_csv()` and `Series.str.contains()` methods which seem a good place to start — G. Anderson, Apr 05 '21 at 21:54
Ideally, most cases do not have multiple keywords in their string, but if I do< would take the last keyword. If that is tricky, either the first keyword detected or classify it as `Other` and I can manually check it later. — Mahmood, Apr 05 '21 at 21:55
@G.Anderson I am using `df = pd.read_csv(r'listings.csv', names=col_names, skiprows=[0])` to import the data and `col_names` is from a list I defined before importing. — Mahmood, Apr 05 '21 at 21:56
That would be good detail to [edit] into your question to make it a [mcve]. The more specific you can be with where you are now, the better we can help you — G. Anderson, Apr 05 '21 at 22:03

score 1 · Accepted Answer · answered Apr 05 '21 at 22:08

1

If you have this datafame:

        Date     Record_Type                                    Location Proterty_Type   Price
0  19-Mar-21  Active Listing              Sky Tower Downtown Los Angeles     Apartment   15000
1  19-Mar-21  Active Listing  Central Park Residential Tower, 5th Avenue     Apartment   17000
2  20-Mar-21  Active Listing                   Meadow Gardens, Park View         Villa  125000

Then:

keywords = ["Downtown", "Park View", "Industrial District"]

df.insert(
    loc=3,
    column="Area",
    value=df["Location"].apply(
        lambda x: next((kw for kw in keywords if kw in x), "Other")
    ),
)
print(df)

Creates Area column next to Location and prints:

        Date     Record_Type                                    Location       Area Proterty_Type   Price
0  19-Mar-21  Active Listing              Sky Tower Downtown Los Angeles   Downtown     Apartment   15000
1  19-Mar-21  Active Listing  Central Park Residential Tower, 5th Avenue      Other     Apartment   17000
2  20-Mar-21  Active Listing                   Meadow Gardens, Park View  Park View         Villa  125000

answered Apr 05 '21 at 22:08

Andrej Kesely

168,389
15
48
91

1

Works perfectly, thank you for that!, any place where you would recommend i check the documenntation for `lambda`? – Mahmood Apr 06 '21 at 07:02
1

@MahmoudAlQadi `lambda` is standard Python syntax, you can start at official Python docs: https://docs.python.org/3/tutorial/controlflow.html#lambda-expressions – Andrej Kesely Apr 06 '21 at 07:10
trying this on a separate dateset, i got an error `TypeError: argument of type 'float' is not iterable` what could be causing this? How would I make it skip floats? – Mahmood Apr 06 '21 at 08:02
1

@MahmoudAlQadi Shouldn't happen, is your column of type `string`? you can do before `df['Location'] = df['Location'].astype(str)` – Andrej Kesely Apr 06 '21 at 08:04
they are strings but as there are so many values I haven't manually checked it all. your fix works tho, thank you! – Mahmood Apr 06 '21 at 08:07

Keyword Detection in CSV and add to new column

1 Answers1