0

In the Perl era I was a regex freak. I definitely struggle adapting to re. To simplify a big data set I needed to search a "|" character and the only combination that would work was re.escape'|' and re.search instead of re.match

import re

x = re.compile((re.escape'|'))
cohort = ['virus_1', 'virus_2|virus_3']

for isolate in cohort:
#   note that re.escape(isolate) fails
    if x.search(isolate):
        print(isolate)

OUTPUT

virus_2|virus_3

Okay the above combination works, but re.match doesn't work. Also why do I need re.escape('|') and why does re.escape(isolate), i.e. the list element, fail? What am I missing to routinely use re?

M__
  • 614
  • 2
  • 10
  • 25
  • this should not even be valid Python - are you sure you tyed it like this? Python *does not have* special syntax for regex- they are dealt with by doing function and method calls. – jsbueno May 13 '20 at 14:04
  • 1
    [What is the difference between re.search and re.match?](https://stackoverflow.com/questions/180986/what-is-the-difference-between-re-search-and-re-match) TL;DR: `re.match` is looking for matches **from the start of the string** – Tomerikoo May 13 '20 at 14:10
  • 1
    Recall that `re.match` always matches from the beginning of the string. you can fix your regex here with `x = re.compile((r'.*\|'))` You would need to escape the `|` alteration metachracter in Perl too to match the literal `'|'` in a string btw. – dawg May 13 '20 at 14:10
  • 1
    You don't have to use `re.escape`, you can just escape: `re.compile(r'\|')` – Tomerikoo May 13 '20 at 14:12
  • Okay thanks everyone I get it. That was much easier than I thought. Regarding pythonic comprehensions etc... in its disapline specific. – M__ May 13 '20 at 14:37

1 Answers1

1

So, there are two things that likely differ from perl: "re.match" in Python have to match the string beginning -
that is: you have to create a regexp which matches from the start of the string on. To find a pattern anywhere in the string you can use re.search or re.findall.

The other thing indeed has to do with escaping: the \ character being used by the Python parser, before compiling the code, to indicate special control characters will likely have issues inside plain strings passed to re calls. So Python have a special form of strings, where the quote is prefixed with an r, like r"regexp_here", where the parser does not touch the \ character and creates a string object always containing the literal \ character. This string is suitable to be passed as an argument to the various re functions. And then, you just have to escape the | with an \ normally, inside an r marked string:

In [164]: cohort = ['virus_1', 'virus_2|virus_3']                                                                                    

In [165]: [string for string in cohort if re.search(r"\|", string)]                                                                  
Out[165]: ['virus_2|virus_3']

In [166]: [string for string in cohort if re.match(r"^.*?\|", string)]                                                               
Out[166]: ['virus_2|virus_3']

In [167]: [string for string in cohort if re.match(r"\|", string)]                                                                   
Out[167]: []
jsbueno
  • 99,910
  • 10
  • 151
  • 209