Python-searching data frame for words in a list with special characters. Output not as expected

Question

I recently asked the below question and the accepted answer was perfect for what I needed at the time, however, now my search list has evolved to include parentheses. I still want to find every instance of the words in my search list and keep count.

Python-searching data frame for words in a list and keep track of words found AND frequency

I know that parentheses have special meaning in regex so I have made so I made the below modification to escape them (in addition to adding the ?: to change the capturing groups) , however, the output is still off and inserting '' where ever there are parentheses. I can't post my data, but have created an example below.

import re

search_list = ['STEEL','STEEL (ST)','(ST)','IRON','GOLD','GD','(GD)','SILVER']

df['c'] = df.b.str.findall('(?:\({0}\))'.format('|'.join(search_list)), flags=re.IGNORECASE)
df['d'] = df['c'].str.len()

      a    b                           c                     d
0    123   'Blah Blah Steel'         ['STEEL']               1
1    789   'Blah Blah Steel Gold'    ['STEEL','GOLD']        2
2    789   'Blah Blah (ST)'          [('ST', '')]            1
3    790   'Blah Blah (ST) blah (GD)'[('ST', ''), ('', 'GD)] 2

I have tried/looked into the following the following:

Various other methods of escaping characters, but all lead to a similar output as above (solutions noted below)
re.escape (I am still not clear how to use this together with re.findall())
re.search (I have only seen this used to search within a list rather than searching for a list within a dataframe)
Searched alternatives to a regex solution, but have yet to come across any

Some of the other solutions I have referenced are: Python regex: matching a parenthesis within parenthesis How to search a string with parentheses using regular expression? Get the string within brackets in Python

The above output is manageable, but not ideal and I am worried it might not actually be working as I think which could cause errors later down the road. I am super new to Python and have always struggled greatly with regular expressions to begin with. Any information would be much appreciated !

Please check the threads I linked to on top of the question. Let know if it helps. — Wiktor Stribiżew, Oct 19 '20 at 20:42
Strange that `df['d'] = df['c'].str.len()` gets the number of elements in the array. I guess str can mean array ? — , Oct 19 '20 at 21:44
What your regex matches is this https://regex101.com/r/x8aMsn/1. The regex itself appears to show a very strange alternation list. And it is hard to glean what exactly what it is you're trying to match from your made up output. If you could take each alternation by itself, individually and plug it into that join statement, you will see what I mean. — , Oct 19 '20 at 22:15
Here is the string Python's findall() is getting from your code: `>>> search_list = ['STEEL','STEEL (ST)','(ST)','IRON','GOLD','GD','(GD)','SILVER'] >>> print( '(?:\({0}\))'.format('|'.join(search_list)))` **(?:\\(STEEL|STEEL (ST)|(ST)|IRON|GOLD|GD|(GD)|SILVER\\))** I realize it's case in-sensitive but that's not the issue. — , Oct 19 '20 at 22:16
I referenced the solutions provided by @Wiktor Stribiżew and changed my code to: *df['c'] = df.b.str.findall('(?<!\w)(?:{0})(?!\w)'.format('|'.join(map(re.escape,search_list))), flags=re.IGNORECASE)*. I added to re.escape earlier when I was reseraching other soltutions. I am not sure if if is still necessary. — CJJ, Oct 20 '20 at 14:20
So, does it work as expected now? If not, please update the question and let know via a comment containing `@`+username. — Wiktor Stribiżew, Oct 20 '20 at 14:28
@WiktorStribiżew, it is working as expected now. Thank you for pointing me in the right direction. — CJJ, Oct 20 '20 at 14:53

Python-searching data frame for words in a list with special characters. Output not as expected

0 Answers0