0

Using Pandas, I am taking a large CSV file, and only want rows that contain certain exact strings. I have this working but feel there should be a better way as it is very slow and difficult to maintain the search patterns as I add more terms to search. Here is the code snippet:

regex_search = '(?:\,|^)EXACT PATTERN 1(?:\,|$)|(?:\,|^)EXACT PATTERN 2(?:\,|$)'
results = df[df['Column_to_search'].str.contains(regex_search)]
#now spit out all the rows where the column had the matched value in new CSV file

The regex I am using basically says:

(?:\,|^)  --> pattern must be preceded by a comma
(?:\,|$)  --> pattern must be followed by a comma
|         --> OR ,so that I can match as many search terms as needed
#...
#df is just the data frame that was loaded via pandas

This list causes a lot of problems for maintenance! I have to take the list and run it through a loop to add the regex strings, and then have to format any new phrases as they are needed.

Originally, I had the search term as:

regex_search = 'EXACT PATTERN 1|EXACT PATTERN 2'

This is easier to maintain but this causes problems because it is regex it will also match a ton of false positives, especially for smaller phrases or acronyms.

The regex_search variable usually has 300+ phrases to search and the csv files have thousands of rows. Is there a python function to do this? Maybe something like:

.str.match_multiple()
#or
regex_search_list = ['ABC','XYZ','ETC']
.str.match_in_list(regex_search_list)

I don't think I can use .match because my regex string has multiple values. If there is a way for pandas to match column values based upon a list, I haven't found it yet.

Thoughts? Is there a better way to do this?

jleatham
  • 456
  • 8
  • 17
  • 1
    Are you asking if pandas can search a column for a list of values? `df[df.column_name.isin(list_of_values)]` It has to be an exact match. – A.Kot Sep 13 '17 at 16:33
  • From what I have tested, the `str` accessor isn't faster than using a for loop in Python. That might give you a small speed up. Then I would suggest checking out [cython](http://cython.org/). – Ted Petrou Sep 13 '17 at 16:39
  • The problem with this alternation based regex is that each alternative matches at the same location. Use `(?:,|^)(?:patt1|patt2|pattN)(?:,|$)`. Also, the alternation in the beginning and end can be made up for using `(?<![^,])` and `(?![^,])` (not sure if it is faster). Then, you may use [trie to build an efficient regex.](https://stackoverflow.com/a/42789508/3832970) – Wiktor Stribiżew Sep 13 '17 at 17:20
  • A.Kot: wow. .isin() is sooo much faster! And it is much easier to manage as everything is in a list. I think Wiktor: I haven't tried your solution yet, and might not need to because .isin works so well, but good call, I think you are right that this would be way better than my regex. – jleatham Sep 13 '17 at 18:59

1 Answers1

0

Thanks to A.Kot in comments, i used .isin() and it took my script from 20 minutes to about 10 seconds.

New code :

list_search = ['EXACT PATTERN 1', 'EXACT PATTERN 2']
results = df[df['Column_to_search'].isin(list_search)]
jleatham
  • 456
  • 8
  • 17