0

I'm trying to parse all the instances of a name and a last name from a string in an outlook "to" convention, and save each one in a python list. I'm using Python 3.6.4.
For example, I would like the folllowing string:

"To: John Lennon <John.Lennon@gmail.com> \b002; Paul McCartney <Paul.McCartney@yahoo.com> \b002;"

to be parsed into:

['John Lennon','Paul McCartney']

I used Replace all words from word list with another string in python as a reference and came up with this code:

import re
prohibitedWords = [r'to:',r'To:','\b002',"\<(.*?)\>"]
mystring = 'To: John Lennon <John.Lennon@gmail.com> \b002; Paul McCartney <Paul.McCartney@yahoo.com> \b002;'
big_regex = re.compile('|'.join(prohibitedWords))
the_message = big_regex.sub("", str(mystring)).strip()
print(the_message)

However, I'm getting the following results:

John Lennon  ; Paul McCartney  ;

This is not optimal as I'm getting lots of spaces which I cannot parse. In addition, I have a feeling this is not the optimal approach for this. Appreciate any advice.
Thanks

Asaf Lahav
  • 57
  • 2
  • 8

1 Answers1

1

Using re.sub and creating an alternation with these parts [r'to:',r'To:','\b002',"\<(.*?)\>"] you will replace the matches with an empty string.

If all the characters that you want to remove are gone, you will end up with a string John Lennon Paul McCartney as in this Python example where you don't know which part belongs where if you for example want to split.

Also removing the surrounding whitespace chars might lead to unexpected gaps or concatenation results when removing them.

You could make the match more specific by matching the possible leading parts, and capture the part that you want instead of replacing.

(?:\\b[Tt]o:|\b002;)\s*(.+?)\s*<[^<>@]+@[^<>@]+>
  • (?:\\b[Tt]o:|\b002;) Match either To to or a backspace char and 002
  • \s* Match optional whitespace chars
  • (.+?) Capture 1 or more chars in group 1
  • \s* Match optional whitspace chars
  • <[^<>@]+@[^<>@]+> Match a single @ between tags

See a regex demo and a Python demo.

For example

import re

pattern = "(?:\\b[Tt]o:|\b002;)\s*(.+?)\s*<[^<>@]+@[^<>@]+>"
mystring = 'To: John Lennon <John.Lennon@gmail.com> \b002; Paul McCartney <Paul.McCartney@yahoo.com> \b002;'
print(re.findall(pattern, mystring))

Output

['John Lennon', 'Paul McCartney']
The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • `\b` in `\b002;` is a backspace char. OP has got a simple string literal, not a raw one. – Wiktor Stribiżew Nov 16 '21 at 17:32
  • 1
    @WiktorStribiżew So `\b` as reading [here](https://docs.python.org/3/howto/regex.html#more-metacharacters) without the `r'` would become `\x08` in both the regex and `mystring` like `"(?:\\b[Tt]o:|\b002;)\s*(.+?)\s*<[^<>@]+@[^<>@]+>"` right? – The fourth bird Nov 16 '21 at 18:05