0

I have a data frame with multiple columns, one of which contains strings separated by spaces -- these strings are titles for property listings and have upper and lower case words. I'm trying to write a for loop and a list comprehension, using Python regex module (re) that will iterate over the strings and return either a Boolean (True/False) or a categorical name, based on a defined list of search words. Lastly, I'd like to output this in a new column in the data frame.

Here's a minimum example of my data frame:

data = {'id': [748, 896, 5268],
        'name' : ['Bright, Modern Garden Unit - 1BR/1BTH', 'Renovated Alamo Square Victorian', 'Mission Sunny, near Park'],
        'price': [209, 255, 180]}
df = pd.DataFrame(data)
print(df)

This is what it produces:

     id                                   name  price
0   748  Bright, Modern Garden Unit - 1BR/1BTH    209
1   896       Renovated Alamo Square Victorian    255
2  5268               Mission Sunny, near Park    180

This is what I want to get for the Boolean output:

     id                                   name  price  amenities_bool
0   748  Bright, Modern Garden Unit - 1BR/1BTH    209            True
1   896       Renovated Alamo Square Victorian    255            True
2  5268               Mission Sunny, near Park    180            True

This is what I want to get for the specified categorical output:

     id                                   name  price  amenities_bool  \
0   748  Bright, Modern Garden Unit - 1BR/1BTH    209            True   
1   896       Renovated Alamo Square Victorian    255            True   
2  5268               Mission Sunny, near Park    180            True   

  amenities_descp  
0          bright  
1       renovated  
2            near 

What I've done so far:

I used this code to search for specific words in the string column individually(Note: df_deep_2 is my original df, not the minimal example provided above):

df_deep_2[df_deep_2['name'].str.contains('modern', regex=True, flags=re.IGNORECASE)].shape

It returns:

(13267, 21)

I'd like to use something like the following examples to achieve said goal, but I don't know syntax or characters for regular expressions beyond what I've demonstrated already:

For Boolean output, for example:

amenities_descp = ['parking', 'free', 'air', 'wifi', 'pool', 'hot tub', 'close', 'garden', 'bright', 'luxury', 'renovated', 'modern', 'green', 'near', 'convenient']

df['amenities_bool'] = False  # default value

for index, row in df.iterrows():
    if row['name'] in amenities_descp:
        df.at[index, 'amenities_bool'] = True

For the specified categorical output, for example:

 amenities_spec = {'parking': 'parking', 'free': 'free', 'air': 'air', 'wifi': 'wifi', 'pool': 'pool', 'hot tub': 'hot tub', 'close': 'close', 'garden': 'garden', 'bright': 'bright', 'luxury': 'luxury', 'renovated': 'renovated', 'modern': 'modern', 'green': 'green', 'green': 'green', 'convenient': 'convenient'}

df['amenities_type'] = [amenities_spec[amenity] if amenity in amenities_spec else 'None' for amenity in df['name']]

Where I'm stuck is how/where to incorporate regular expression syntax; the closest I've gotten is the following (Note: df_deep_copy is a copy of df_deep_2):

df_deep_copy['amenities_bool'] = [True if amenity in amenities else False for amenity in df_deep_copy[df_deep_copy['name'].str.contains(amenities_desc, regex=True, flags=re.IGNORECASE)]]

This results in a type error of unhashable type: list. I realize the issue is with the first argument after .str.contains -- it seems you can't use a list as an input, but I'm stumped on what other function I should use to achieve this. This is the closest I've found:

Search for a word in a DataFrame column and ignore regex and substring

  • If I'm following this correctly, you're saying you don't know regex, so what do you need our help with? We don't provide tutorials here, if that's what you were thinking. But there are lots of other resources you could use, like the official [Python regex howto](https://docs.python.org/3/howto/regex.html) for example. – wjandrea Aug 10 '23 at 20:57
  • Otherwise, the question is pretty unclear IMHO. You might want to take a step back and rethink what you want to ask. I guess essentially you're trying to ask how to convert a list to regex, but there are existing questions about that, like [how to do re.compile() with a list](/q/6750240/4518341). And you might want to rethink your methods, cause with Pandas, [you shouldn't be iterating manually](/a/55557758/4518341) and there's no need to use `re` when Pandas has regex built-in. – wjandrea Aug 10 '23 at 20:59
  • Lastly, the example has a few problems: 1) `amenities_spec` has invalid syntax. 2) some names that aren't defined: `df_deep_2`, `df_deep_copy`, `amenities_desc`. 3) The `shape` is way larger than the example, so that's probably the result from your real-life df. – wjandrea Aug 10 '23 at 21:01
  • FWIW, it looks like `df['amenities_bool'] == df['amenities_descp'].notna()`, so that could be a useful shortcut, rather than processing all the names twice. – wjandrea Aug 10 '23 at 21:15
  • @misterhuge Thank you for response -- the solution you suggest worked very well for the Boolean operation and I was able to use a vectorized operation to produce a new column with searched words. – new_to_code Aug 11 '23 at 18:30

1 Answers1

0

I think you could use Series.str.contains(pat) which accepts a regex like so:

data = {'id': [748, 896, 5268],
                'name' : ['Bright, Modern Garden Unit - 1BR/1BTH', 'Renovated Alamo Square Victorian', 'Mission Sunny, near Park'],
                'price': [209, 255, 180]}
df = pd.DataFrame(data)
print(df)

     id                                   name  price
0   748  Bright, Modern Garden Unit - 1BR/1BTH    209
1   896       Renovated Alamo Square Victorian    255
2  5268               Mission Sunny, near Park    180

amenities_descp = ['parking', 'free', 'air', 'wifi', 'pool', 'hot tub', 'close', 'garden', 'bright', 'luxury', 'renovated', 'modern', 'green', 'near', 'convenient']

df['amenities_bool'] = df.name.str.contains('|'.join(amenities_descp))
print(df)

     id                                   name  price  amenities_bool
0   748  Bright, Modern Garden Unit - 1BR/1BTH    209           False
1   896       Renovated Alamo Square Victorian    255           False
2  5268               Mission Sunny, near Park    180            True

and instead of specific words you can do regex type matches

rooms_descp = ('\d+BR','\d+BTH')
df['rooms_bool'] = df.name.str.contains('|'.join(rooms_descp))
print(df)

     id                                   name  price  rooms_bool
0   748  Bright, Modern Garden Unit - 1BR/1BTH    209        True
1   896       Renovated Alamo Square Victorian    255       False
2  5268               Mission Sunny, near Park    180       False​​
misterhuge
  • 354
  • 1
  • 7
  • You need to add `flags=re.IGNORECASE` to do caseless matching. But I actually found it easier to do `df['name'].str.lower()` first, then you can `extract` into `amenities_descp` more easily. – wjandrea Aug 10 '23 at 21:29