1

I am learning Regular Expressions from some website, and I am having some trouble understanding usage of metacharacters and usage of backslashes in raw string.

import re

pattern = r"(.+) \1"

match = re.match(pattern, "word word")   
if match:
   print ("Match 1")

match = re.match(pattern, "?! ?!")
if match:
   print ("Match 2")    

match = re.match(pattern, "abc cde")
if match:
   print ("Match 3")

My main doubt is the use of (.+) here and the backslash used. What would be the output if instead of 1 it was 2? I know + means "one or more repetitions".

H. Garg
  • 41
  • 4

1 Answers1

2

When you do this:

r"(.+) \1"

means that \1 should match what is captured exactly by the first group. It didn't match "abc cde" because first group captured abc the so it's like you are matching this: re.match(r'abc abc', text). This called back reference a group.

For example you need to match a text that start end ends with the same letters:

import re

pattern = r"(\w).+\1"

match = re.match(pattern, "ABA")  # OK
match = re.match(pattern, "ABC")  # NO

Another example match text that start with 3 letters and ends with this letters in the inverse order

import re

pattern = r"(\w)(\w)(\w)\3\2\1"
re.match(pattern, 'ABCCBA') # OK
re.match(pattern, 'ABCCBC') # NO

Note: you can only back-reference only a capturing group, means that this is not valid (?:.+) \1 because the first group will match and not capture anything so you cannot back-reference it.

Edits

  • + which matches one or more times, requires at least one occurrence
  • * matches zero or more times

ca+t match cat, caat , caaat : matches c followed by at least one a or more followed by t.

ca+t match ct, cat , caaaaat : matches c followed by zero or or more a followed by t

Charif DZ
  • 14,415
  • 3
  • 21
  • 40