1

I need to detect @username mentions within a message, but NOT if it is in the form of @username[user_id]. I have a regex that can match the @username part, but am struggling to negate the match if it is followed by \[\d\].

import re

username_regex = re.compile(r'@([\w.@-]+[\w])')

usernames = username_regex.findall("Hello @kevin") # correctly finds kevin
usernames = username_regex.findall("Hello @kevin.") # correctly finds kevin
usernames = username_regex.findall("Hello @kevin[1].") # shouldn't find kevin but does

The regex allows for usernames that contain @, . and -, but need to end with a \w character ([a-zA-Z0-9_]). How can I extend the regex so that it fails if the username is followed by the userid in the [1] form?

I tried @([\w.@-]+[\w])(?!\[\d+\]) but then it matches kevi

I'm using Python 3.10.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Kevin Renskers
  • 5,156
  • 4
  • 47
  • 95
  • 1
    `.compile(r'@([\w.@-]*\w)\b(?!\[\d+\])')`, add a word boundary. – Wiktor Stribiżew May 26 '21 at 11:56
  • Wow, that was fast, thanks! Can you explain why/how it works? If you put it in an answer I can accept it as well. – Kevin Renskers May 26 '21 at 11:57
  • Sadly it doesn't work correctly: `usernames = username_regex.findall("Hello @kev._in[1].")` finds `kev`, whereas it shouldn't find anything. It's because usernames can have more than just letters and numbers. – Kevin Renskers May 26 '21 at 12:04
  • 1
    Kevin, I wanted to add that the problem could have been solved if Python `re` supported possessive quantifiers or atomic groups. Unfortunately, it is not so. All you would need then would be a pattern like `@[\w.@-]*+\w(?!\[\d+])`. See [an example and explanation of how possessive quantifiers work](https://stackoverflow.com/q/51264400/3832970). – Wiktor Stribiżew May 27 '21 at 08:32
  • 1
    No problem, done. – Kevin Renskers May 28 '21 at 09:50

1 Answers1

2

You can "emulate" possessive matching with

@(?=([\w.@-]*\w))\1(?!\[\d+\])

See the regex demo.

Details:

  • @ - a @ char
  • (?=([\w.@-]*\w)) - a positive lookahead that matches and captures into Group 1 zero or more word, ., @ and - chars, as many as possible, and then a word char immediately to the right of the current position (the text is not consumed, the regex engine index stays at the same location)
  • \1 - the text matched and captured in Group 1 (this consumes the text captured with the lookahead pattern, mind that backreferences are atomic by nature)
  • (?!\[\d+\]) - a negative lookahead that fails the match if there is [ + one or more digits + ] immediately to the right of the current location.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563