0

I have a situation where I want to search if a substring exists in a large text. So, I was simply using:

if pattern in text: ...

But, I want to ensure that the existence of "pattern" in "text" is not immediately prefixed or suffixed by alphabets. It's alright if it is lead or trailed by special characters, numbers or whitespaces.

So, if pattern is "abc", match on "some text abc", "random texts, abc, cde" should return True, while search on "some textabc", "random abctexts" should return False (because "abc" is lead or trailed by alphabets).

What is the best way to perform this operation?

vish4071
  • 5,135
  • 4
  • 35
  • 65
  • `r'(?:[^a-zA-Z])(abc)(?:[^a-zA-Z])'` will capture only `abc`. `(?: ...)` indicates a _non-capturing group_, so you don't capture the non-alphabets characters. You can check this [community guide on regex](https://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean) and feel free to experiment with tools like [regex101](https://regex101.com/r/LVAwxP/1) – Ignatius Reilly Oct 11 '22 at 17:19

1 Answers1

1

How about this:

import re

string = "random texts, abc, cde"

match = re.search(r'(^|[^a-zA-Z])abc([^a-zA-Z]|$)', string)
# If-statement after search() tests if it succeeded
if match:
    print('found', match.group())
else:
    print('did not find')

"(^|[^a-zA-Z])" means: beginning of string OR any non-alphabetic character, ([^a-zA-Z]|$) similar for end of string.

To explain a bit more: "|" means an OR, so (^|d) means "beginning of line or a d". The brackets are to define on which arguments the OR operator operates. You wanted your abc-string not to be enclosed by any alphabetic character. If you broaden this a little, so that also 0-9 and the underscore are forbidden, you get a simpler regex: r'(^|\W)abc(\W|$)'

Erik-Jan O
  • 24
  • 4