RegEx prioritize the longest pattern

Question

I got some strings to search for a match with regular expressions.

foo
AB0001
AB0002 foo
foo AB0003
foo AB0004A AB0004.1
AB0005.1 foo AB0005A bar AB0005

The desired matches are one ID per line while IDs with a letter at the end should be prioritized, whereas IDs with a .1 should be ignored.

foo                              -> no match
AB0001                           -> AB0001
AB0002 foo                       -> AB0002
foo AB0003.1                     -> no match
foo AB0004A AB0004.1             -> AB0004A
AB0005.1 foo AB0005A bar AB0005  -> AB0005A

I thought I could easily use the priority given by the alternation | symbol to prioritize the ID with a capital letter at the end but nevertheless there are always given multiple matches.

My suggestion: regex101.com/r/yP5kX4/1

Offtopic: When to use the whole RegEx starting with ^ ending with $ and work with capture/non-capturing groups and when should I write RegEx as short as possible?

You cannot achieve that with a pure PCRE/TRE regex in R. – Wiktor Stribiżew Mar 04 '16 at 08:58 — Wiktor Stribiżew, Mar 04 '16 at 08:58

score 1 · Accepted Answer · 2016-03-04T01:29:04.090

1

This is one way. It's kind of complex because you need to be lazy to find
the first instance of ID.

This regex is to be used in Multi-Line mode. Add a (?m) to the beginning
of the regex if you can.

The resulting ID is in capture group 1.

^.*?\b([A-Z]+\d+[A-Z]|[A-Z]+\d+(?!\.\d)(?!.*?\b[A-Z]+\d+[A-Z]))\b

Explained

 ^                                  # Beginning of string
 .*?                                # Any char, lazy to get first instance
 \b    
 (                                  # (1 start), the ID
      [A-Z]+ \d+ [A-Z]                   # Priority, with trailing letter
   |                                   # or,
      [A-Z]+ \d+                         # no trailing letter
      (?! \. \d )                        # no dot digit after digit
      (?! .*? \b [A-Z]+ \d+ [A-Z] )      # and only if not a trailing  letter id downstream
 )                                  # (1 end)
 \b

edited Mar 04 '16 at 01:29

answered Mar 04 '16 at 01:14

Thanks this works as intended! The negative lookahead just discards the "match" from being recognized? I understand the first expression after `|` and the second, but I dont get the third. Why is this needed? Maybe just for the case if strings occur like `AB0001A AB0001.1 AB0002A`? So the first ID is marked as match then the negative lookahead finds the third ID and discards the previous match by taking the `AB0002A` as final match? – heiiRa Mar 04 '16 at 22:56
@heiiRa - Almost right. This is the base expression `[A-Z]+ \d+ [A-Z]?` It's just split up as an _OR_ (alternation). The engine tests the alternation at each character position. If it can't find `[A-Z]+ \d+ [A-Z]` it will match `[A-Z]+ \d+` if it can. The first _assertion_ `(?! \. \d )` stops it from matching _any_ `AB0001.1`, the second assertion `(?! .*? \b [A-Z]+ \d+ [A-Z] )` stops it from matching the current candidate `AB0001` _if there is an `AB0001B` anywhere to the right_. It then checks each character until it gets to `AB0001B` which it matches. – Mar 04 '16 at 23:56

score 0 · Answer 2 · answered Mar 04 '16 at 01:18

I'd like to detect string in R 3.1.3 this way:

grepl("(?<!\\.)[A-Z0-9]+?(?=\\s)", subject, perl=TRUE);

based on input you posted in your question, output will be:

INPUT

foo
AB0001
AB0002 foo
foo AB0003
foo AB0004A AB0004.1
AB0005.1 foo AB0005A bar AB0005

-

OUTPUT

AB0001
AB0002
AB0003
AB0004A
AB0005A

score 0 · Answer 3 · edited May 23 '17 at 12:31

The following regex should do:

(AB(?:[0-9A-Z]{5}|[0-9]{4}))(?:\s+)

I added a non-capturing group (?:\s+) to capture space(s) after ID match. The demo is HERE:

My thoughts: (Please correct me if I am wrong)

When to use the whole RegEx starting with ^ ending with $? If regex is to match from start (^) to the end ($) of the whole string.

And work with capture/non-capturing groups? Use capturing groups if you want to extract/reference that information; use non-capturing groups if you just want to match, but no extracting and referencing. Please take a look at: What is a non-capturing group? What does a question mark followed by a colon (?:) mean?.

When should I write RegEx as short as possible? The shorter the better, as long as it works

score 0 · Answer 4 · answered Mar 04 '16 at 05:27

\b(AB\d{4}(?!\.\d)[A-Z]?)\b

DEMO

That's AB followed by four digits, which must not be followed by a decimal-digit sequence, but may end with a letter. The word boundaries (\b) help insure that the matched sequence is not part of a longer sequence that just happens to look like an ID.

An alternation-based solution is never going to work. It's true that if two or more branches of an alternation can match at a given point, the first one is always selected (in most regex flavors, anyway). But that doesn't help you, because the regex engine always favors the first (leftmost) match; that's its highest priority. So the first match wins no matter which branch of the alternation it uses.

As for the anchors (^ and $), they're usually needed only when you want to match the whole string, or a whole line in multiline mode (and BTW, since you're not using them, you don't need the /m flag; all it does is change the meaning of the anchors).

The issue of capturing groups is interesting here because you don't need them. The only reason I used one is because the Regex101 site doesn't show the matches in the side panel unless they're in capturing groups. It's an annoying glitch in an otherwise very useful site. But generally speaking, you use capturing groups when you need to extract specific portions of the match, or when you need to use backreferences in the regex itself.

RegEx prioritize the longest pattern

4 Answers4