2

I'm trying to use regular expressions to capture all Twitter handles within a tweet body. The challenge is that I'm trying to get handles that

  1. Contain a specific string
  2. Are of unknown length
  3. May be followed by either
    • punctuation
    • whitespace
    • or the end of string.

For example, for each of these strings, Ive marked in italics what I'd like to return.

"@handle what is your problem?" [RETURN '@handle']

"what is your problem @handle?" [RETURN '@handle']

"@123handle what is your problem @handle123?" [RETURN '@123handle', '@handle123']

This is what I have so far:

>>> import re
>>> re.findall(r'(@.*handle.*?)\W','hi @123handle, hello @handle123')
['@123handle']
# This misses the handles that are followed by end-of-string

I tried modifying to include an or character allowing the end-of-string character. Instead, it just returns the whole string.

>>> re.findall(r'(@.*handle.*?)(?=\W|$)','hi @123handle, hello @handle123')
['@123handle, hello @handle123']
# This looks like it is too greedy and ends up returning too much

How can I write an expression that will satisfy both conditions?

I've looked at a couple other places, but am still stuck.

Community
  • 1
  • 1
plfrick
  • 1,109
  • 12
  • 12

2 Answers2

3

It seems you are trying to match strings starting with @, then having 0+ word chars, then handle, and then again 0+ word chars.

Use

r'@\w*handle\w*'

or - to avoid matching @+word chars in emails:

r'\B@\w*handle\w*'

See the Regex 1 demo and the Regex 2 demo (the \B non-word boundary requires a non-word char or start of string to be right before the @).

Note that the .* is a greedy dot matching pattern that matches any characters other than newline, as many as possible. \w* only matches 0+ characters (also as many as possible) but from the [a-zA-Z0-9_] set if the re.UNICODE flag is not used (and it is not used in your code).

Python demo:

import re
p = re.compile(r'@\w*handle\w*')
test_str = "@handle what is your problem?\nwhat is your problem @handle?\n@123handle what is your problem @handle123?\n"
print(p.findall(test_str))
# => ['@handle', '@handle', '@123handle', '@handle123']
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
0

Matches only handles that contain this range of characters -> /[a-zA-Z0-9_]/.

s = "@123handle what is your problem @handle123?"
print re.findall(r'\B(@[\w\d_]+)', s)
>>> ['@123handle', '@handle123']
s = '@The quick brown fox@jumped over the LAAZY @_dog.'
>>> ['@The', '@_dog']
ospahiu
  • 3,465
  • 2
  • 13
  • 24