I have a data frame with multiple columns, one of which contains strings separated by spaces -- these strings are titles for property listings and have upper and lower case words. I'm trying to write a for loop and a list comprehension, using Python regex module (re) that will iterate over the strings and return either a Boolean (True/False) or a categorical name, based on a defined list of search words. Lastly, I'd like to output this in a new column in the data frame.
Here's a minimum example of my data frame:
data = {'id': [748, 896, 5268],
'name' : ['Bright, Modern Garden Unit - 1BR/1BTH', 'Renovated Alamo Square Victorian', 'Mission Sunny, near Park'],
'price': [209, 255, 180]}
df = pd.DataFrame(data)
print(df)
This is what it produces:
id name price
0 748 Bright, Modern Garden Unit - 1BR/1BTH 209
1 896 Renovated Alamo Square Victorian 255
2 5268 Mission Sunny, near Park 180
This is what I want to get for the Boolean output:
id name price amenities_bool
0 748 Bright, Modern Garden Unit - 1BR/1BTH 209 True
1 896 Renovated Alamo Square Victorian 255 True
2 5268 Mission Sunny, near Park 180 True
This is what I want to get for the specified categorical output:
id name price amenities_bool \
0 748 Bright, Modern Garden Unit - 1BR/1BTH 209 True
1 896 Renovated Alamo Square Victorian 255 True
2 5268 Mission Sunny, near Park 180 True
amenities_descp
0 bright
1 renovated
2 near
What I've done so far:
I used this code to search for specific words in the string column individually(Note: df_deep_2 is my original df, not the minimal example provided above):
df_deep_2[df_deep_2['name'].str.contains('modern', regex=True, flags=re.IGNORECASE)].shape
It returns:
(13267, 21)
I'd like to use something like the following examples to achieve said goal, but I don't know syntax or characters for regular expressions beyond what I've demonstrated already:
For Boolean output, for example:
amenities_descp = ['parking', 'free', 'air', 'wifi', 'pool', 'hot tub', 'close', 'garden', 'bright', 'luxury', 'renovated', 'modern', 'green', 'near', 'convenient']
df['amenities_bool'] = False # default value
for index, row in df.iterrows():
if row['name'] in amenities_descp:
df.at[index, 'amenities_bool'] = True
For the specified categorical output, for example:
amenities_spec = {'parking': 'parking', 'free': 'free', 'air': 'air', 'wifi': 'wifi', 'pool': 'pool', 'hot tub': 'hot tub', 'close': 'close', 'garden': 'garden', 'bright': 'bright', 'luxury': 'luxury', 'renovated': 'renovated', 'modern': 'modern', 'green': 'green', 'green': 'green', 'convenient': 'convenient'}
df['amenities_type'] = [amenities_spec[amenity] if amenity in amenities_spec else 'None' for amenity in df['name']]
Where I'm stuck is how/where to incorporate regular expression syntax; the closest I've gotten is the following (Note: df_deep_copy is a copy of df_deep_2):
df_deep_copy['amenities_bool'] = [True if amenity in amenities else False for amenity in df_deep_copy[df_deep_copy['name'].str.contains(amenities_desc, regex=True, flags=re.IGNORECASE)]]
This results in a type error of unhashable type: list. I realize the issue is with the first argument after .str.contains -- it seems you can't use a list as an input, but I'm stumped on what other function I should use to achieve this. This is the closest I've found:
Search for a word in a DataFrame column and ignore regex and substring