5

I'm looking to find words in a string that match a specific pattern. Problem is, if the words are part of an email address, they should be ignored.

To simplify, the pattern of the "proper words" \w+\.\w+ - one or more characters, an actual period, and another series of characters.

The sentence that causes problem, for example, is a.a b.b:c.c d.d@e.e.e.

The goal is to match only [a.a, b.b, c.c] . With most Regexes I build, e.e returns as well (because I use some word boundary match).

For example:

>>> re.findall(r"(?:^|\s|\W)(?<!@)(\w+\.\w+)(?!@)\b", "a.a b.b:c.c d.d@e.e.e") ['a.a', 'b.b', 'c.c', 'e.e']

How can I match only among words that do not contain "@"?

alon
  • 51
  • 1
  • instead of trying to get a clever regex going, perhaps clean up the string first? first strip \w+@\w+ then process. I do a lot of ETL work with python and often it's just easier/faster to clean up trash, then split/process the data. – sniperd Aug 01 '17 at 15:16
  • http://www.rexegg.com/regex-best-trick.html#thetrick – bobble bubble Aug 01 '17 at 17:34

3 Answers3

2

I would definitely clean it up first and simplify the regex.

first we have

words = re.split(r':|\s', "a.a b.b:c.c d.d@e.e.e")

then filter out the words that have an @ in them.

words = [re.search(r'^((?!@).)*$', word) for word in words]
Cory Madden
  • 5,026
  • 24
  • 37
1

Properly parsing email addresses with a regex is extremely hard, but for your simplified case, with a simple definition of word ~ \w\.\w and the email ~ any sequence that contains @, you might find this regex to do what you need:

>>> re.findall(r"(?:^|[:\s]+)(\w+\.\w+)(?=[:\s]+|$)", "a.a b.b:c.c d.d@e.e.e")
['a.a', 'b.b', 'c.c']

The trick here is not to focus on what comes in the next or previous word, but on what the word currently captured has to look like.

Another trick is in properly defining word separators. Before the word we'll allow multiple whitespaces, : and string start, consuming those characters, but not capturing them. After the word we require almost the same (except string end, instead of start), but we do not consume those characters - we use a lookahead assertion.

randomir
  • 17,989
  • 1
  • 40
  • 55
1

You may match the email-like substrings with \S+@\S+\.\S+ and match and capture your pattern with (\w+\.\w+) in all other contexts. Use re.findall to only return captured values and filter out empty items (they will be in re.findall results when there is an email match):

import re
rx = r"\S+@\S+\.\S+|(\w+\.\w+)"
s = "a.a b.b:c.c d.d@e.e.e"
res = filter(None, re.findall(rx, s))
print(res)
# => ['a.a', 'b.b', 'c.c']

See the Python demo.

See the regex demo.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563