I'm trying to use regular expressions to capture all Twitter handles within a tweet body. The challenge is that I'm trying to get handles that
- Contain a specific string
- Are of unknown length
- May be followed by either
- punctuation
- whitespace
- or the end of string.
For example, for each of these strings, Ive marked in italics what I'd like to return.
"@handle what is your problem?" [RETURN '@handle']
"what is your problem @handle?" [RETURN '@handle']
"@123handle what is your problem @handle123?" [RETURN '@123handle', '@handle123']
This is what I have so far:
>>> import re
>>> re.findall(r'(@.*handle.*?)\W','hi @123handle, hello @handle123')
['@123handle']
# This misses the handles that are followed by end-of-string
I tried modifying to include an or
character allowing the end-of-string character. Instead, it just returns the whole string.
>>> re.findall(r'(@.*handle.*?)(?=\W|$)','hi @123handle, hello @handle123')
['@123handle, hello @handle123']
# This looks like it is too greedy and ends up returning too much
How can I write an expression that will satisfy both conditions?