0

I have a small parser that should be able to parse the message bellow, the last group in the message is the email, which is optional.

Unfortunately, with my current regex I was unable to get the email, the regex returns None/null on email group.

What I need to do to email can be captured and be optional?

import re

# message = "/sell 2000 USDT @ 5.56 111.222.333-44 +123456789"
message = "/sell 2000 USDT @ 5.56 111.222.333-44 +123456789 mail@example.com"

parser = re.compile(
    r"""
    ^/
    (?P<operation>buy|sell)
    \s
    (?P<amount>.+)
    \s
    (?P<network>.+)
    \s
    @
    \s
    (?P<rate>.+)
    \s
    (?P<legal>.+)
    \s
    (?P<cellphone>.+)
    \s?
    (?P<email>.+)?
    $
    """,
    re.VERBOSE,
)

result = parser.match(message)

group = result.groupdict()

print(group)
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Rodrigo
  • 135
  • 4
  • 45
  • 107
  • 5
    If you put `.+` everywhere, all you will produce (if it works) is a pathological pattern (able to produce a timeout). So, be more precise in your string description: use character classes, limited quantifiers (when possible), learn what is the difference between a greedy and a non-greedy quantifier. This is the basis, this is the way to go. Good luck. – Casimir et Hippolyte Nov 23 '21 at 20:46
  • 2
    Look first at the results that you get. You should notice that one of your groups is capturing a lot more than it should. The reason for this is that regex matching is greedy by default. Please see the linked duplicate to understand. In the future, please try to do more thorough debugging, and make sure you actually are completely aware of what the code is doing wrong - not just the part that makes you want to ask the question. – Karl Knechtel Nov 23 '21 at 21:10
  • 2
    Also, consider using other techniques to solve the problem. If you have a line that consists of several whitespace-delimited things, you should first think of using `.split` to get the pieces, and then perhaps regexes can help you validate each piece. – Karl Knechtel Nov 23 '21 at 21:11
  • `.split` probably will help me more than the regex it self. Thanks Karl – Rodrigo Nov 23 '21 at 21:44
  • 1
    @Rodrigo: if you are sure there's no field with a space and if the only optional field (the email here) is at the end, it's clearly the best option. – Casimir et Hippolyte Nov 23 '21 at 21:46

1 Answers1

1

The best approach is to describe each token as close as possible.

You can use

^/       # start of string and /
(?P<operation>buy|sell) #buy or sell
\s+                     # 1+ whitespaces
(?P<amount>\d+)         # 1+ digits
\s+                     # 1+ whitespaces
(?P<network>\S+)        # 1+ non-whitespaces
\s+ @ \s+               # 1+ whitespaces, @, 1+ whitespaces
(?P<rate>\S+)           # 1+ non-whitespaces
\s+                     # 1+ whitespaces
(?P<legal>\S+)          # 1+ non-whitespaces
\s+                     # 1+ whitespaces
(?P<cellphone>\+?\d(?:\s?\d)+) # optional +, digits with possible one whitespace in between each
(?:                     # non-capturing group start:
    \s+ (?P<email>[^\s@]+@\S+) # 1+ whitespaces, email
)?                     # non-capturing group end, optional due to ?
$                      # end of string

See the regex demo

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563