1

I want to find all occurrences of a given phrase in a passage. The phrases are user inputs and cannot be predicted beforehand.

One solution is to use regex to search (findall, finditer) the phrase in the passage:

import re

phrase = "24C"
passage = "24C with"

inds = [m.start() for m in re.finditer(phrase, passage)]

Then the result is

inds = [0]

Because the phrase matches the passage at index 0 and there is only one occurrence.

However, when the phrase contains characters that have special meanings in regex, things are trickier

import re

phrase = "24C (75F)"
passage = "24C (75F) with"

inds = [m.start() for m in re.finditer(phrase, passage)]

Then the result is

inds = []

This is because the parentheses are interpreted specially as a regex pattern, but this is not desirable as I only want to have literal matches.

Is there anyway to enforce the phrase to be treated as string literal, not a regex pattern?

Yo Hsiao
  • 678
  • 7
  • 12
  • 2
    Why are you using regex for this instead of `.find()`? – ctwheels Sep 28 '17 at 19:37
  • Because I need to find "all" occurrences. – Yo Hsiao Sep 28 '17 at 19:41
  • You could always use a while loop and iterate from the last matched position + 1 like in this post: https://codereview.stackexchange.com/questions/146834/function-to-find-all-occurrences-of-substring – ctwheels Sep 28 '17 at 19:46
  • Thought of that but it comes with a catch: we need to handle the word boundaries ourselves. If regex can do it with one line, being bug-free and readable, I think it is a better idea to leverage existing libraries. Thank you for your input! – Yo Hsiao Sep 28 '17 at 19:57

1 Answers1

4

You can use re.escape() to force regex to treat the string as literal:

import re
phrase = "24C (75F)"
passage = "24C (75F) with"
inds = [m.start() for m in re.finditer(re.escape(phrase), passage)]
print(inds)

Output:

[0]
Ajax1234
  • 69,937
  • 8
  • 61
  • 102
  • Excellent! Exactly what I was looking for. Just add some info: the official doc says "Escape all the characters in pattern except ASCII letters, numbers and '_'." But in effect, unicode will be treated literally without an issue. Sweet! – Yo Hsiao Sep 28 '17 at 19:43