I recently asked the below question and the accepted answer was perfect for what I needed at the time, however, now my search list has evolved to include parentheses. I still want to find every instance of the words in my search list and keep count.
Python-searching data frame for words in a list and keep track of words found AND frequency
I know that parentheses have special meaning in regex so I have made so I made the below modification to escape them (in addition to adding the ?: to change the capturing groups) , however, the output is still off and inserting '' where ever there are parentheses. I can't post my data, but have created an example below.
import re
search_list = ['STEEL','STEEL (ST)','(ST)','IRON','GOLD','GD','(GD)','SILVER']
df['c'] = df.b.str.findall('(?:\({0}\))'.format('|'.join(search_list)), flags=re.IGNORECASE)
df['d'] = df['c'].str.len()
a b c d
0 123 'Blah Blah Steel' ['STEEL'] 1
1 789 'Blah Blah Steel Gold' ['STEEL','GOLD'] 2
2 789 'Blah Blah (ST)' [('ST', '')] 1
3 790 'Blah Blah (ST) blah (GD)'[('ST', ''), ('', 'GD)] 2
I have tried/looked into the following the following:
- Various other methods of escaping characters, but all lead to a similar output as above (solutions noted below)
- re.escape (I am still not clear how to use this together with re.findall())
- re.search (I have only seen this used to search within a list rather than searching for a list within a dataframe)
- Searched alternatives to a regex solution, but have yet to come across any
Some of the other solutions I have referenced are: Python regex: matching a parenthesis within parenthesis How to search a string with parentheses using regular expression? Get the string within brackets in Python
The above output is manageable, but not ideal and I am worried it might not actually be working as I think which could cause errors later down the road. I am super new to Python and have always struggled greatly with regular expressions to begin with. Any information would be much appreciated !