3

I'm trying to write regex for the following possible cases. I use re.finditer() along with re.IGNORECASE to match with the strings. Possible cases and the corresponding matches are

  1. 'vessel eta: 12-10-19' should match with 'vessel eta: '
  2. 'vessel eta 12-10-19' should match with 'vessel eta '
  3. 'etd eta : 12/10/19' should match with 'etd eta '
  4. 'eta SIN: 12/10/19' should match with 'eta SIN:'
  5. 'eta : 12-10-19 should match with 'eta :'
  6. 'eta: 12-10-19' should match with 'eta: '
  7. 'eta. 12-10-19' should match with 'eta. '
  8. 'eta 12-10-19' should match with 'eta '

Till now, I wrote this :

((vessel)|(ETD))?(\s\.\:)?(ETA)[\s\.\:]{1,3}?(SIN)?[\s\.\:]?

But as per regex101, this matches with all except the first three cases, where the first word (whether it's 'vessel' or 'etd') is not being captured.

What's wrong with my regex?

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
Arkistarvh Kltzuonstev
  • 6,824
  • 7
  • 26
  • 56
  • 2
    Try `(?:vessel|ETD)?[\s.:]?ETA[\s.:]{1,3}?(?:SIN)?[\s.:]?`. Your `(\s\.\:)?` looks a bit off, didn't you want to match one of the entities and not a sequence? – Wiktor Stribiżew Nov 12 '19 at 09:43
  • @WiktorStribiżew Thanks. Even `(vessel|ETD)?[\s\.\:]?(ETA)[\s\.\:]{1,3}?(SIN)?[\s\.\:]?` solved my issue, What's difference between adding `?:` in front of `vessel` and not adding? – Arkistarvh Kltzuonstev Nov 12 '19 at 09:48
  • If you do not need the extra parts from the match, there is no point using capturing groups, non-capturing ones are best – Wiktor Stribiżew Nov 12 '19 at 09:52

2 Answers2

3

The (\s\.\:)? pattern matches an optional sequence of a whitespace, a dot and then a colon, while you want to match a single optional character, a whtespace, . or :.

Note you overescape chars in the character class: [.] always matches a dot and : is not a special regex metacharacter.

It is advisable to use non-capturing groups ((?:...)) if you do not need to further access parts of the regex matches, or just remove the grouping parentheses altogether when they do not contain alternatives or are not quantified.

You may use

(?:vessel|ETD)?[\s.:]?ETA[\s.:]{1,3}?(?:SIN)?[\s.:]?

See the regex demo.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
0

Your regex should start by matching the optional string "vessel" or "etd" at the beginning of the input:

(vessel|etd)?

It should then match the word "ETA" followed by a colon:

(vessel|etd)?(eta:)

I suppose the rest contains a simple date format, which can be captured by the following:

(vessel|etd)?(eta:)(\d\d-\d\d-\d\d)

This above regex is wrong, though: It does not match any whitespace, only compact strings like "etdeta:12-31-13". We need to insert some instances of "\s+", which translates to "at least one whitespace character":

(vessel\s+|etd\s+)?(eta\s+:\s+)(\d\d-\d\d-\d\d)
soulmerge
  • 73,842
  • 19
  • 118
  • 155
  • My main goal is to capture the corresponding words before the possible date. I've another script to capture any form of date though, thus not included into this regex. Also there should be an optional check for `'SIN'` keyword after `'ETA'` is found. – Arkistarvh Kltzuonstev Nov 12 '19 at 10:02