Using Pandas, I am taking a large CSV file, and only want rows that contain certain exact strings. I have this working but feel there should be a better way as it is very slow and difficult to maintain the search patterns as I add more terms to search. Here is the code snippet:
regex_search = '(?:\,|^)EXACT PATTERN 1(?:\,|$)|(?:\,|^)EXACT PATTERN 2(?:\,|$)'
results = df[df['Column_to_search'].str.contains(regex_search)]
#now spit out all the rows where the column had the matched value in new CSV file
The regex I am using basically says:
(?:\,|^) --> pattern must be preceded by a comma
(?:\,|$) --> pattern must be followed by a comma
| --> OR ,so that I can match as many search terms as needed
#...
#df is just the data frame that was loaded via pandas
This list causes a lot of problems for maintenance! I have to take the list and run it through a loop to add the regex strings, and then have to format any new phrases as they are needed.
Originally, I had the search term as:
regex_search = 'EXACT PATTERN 1|EXACT PATTERN 2'
This is easier to maintain but this causes problems because it is regex it will also match a ton of false positives, especially for smaller phrases or acronyms.
The regex_search variable usually has 300+ phrases to search and the csv files have thousands of rows. Is there a python function to do this? Maybe something like:
.str.match_multiple()
#or
regex_search_list = ['ABC','XYZ','ETC']
.str.match_in_list(regex_search_list)
I don't think I can use .match because my regex string has multiple values. If there is a way for pandas to match column values based upon a list, I haven't found it yet.
Thoughts? Is there a better way to do this?