Regex using positive look ahead not directly after

Question

I have product codes for example HX3923, which always start with 2 capital letters, and end with 4 numbers. Some products have "gold" color, which is somewhere in the text.

Example:

HX3923, width: 0.3, height: 0.7, gold, HX3924, color="blue", width=0.3

I need to match HX3923, but not HX3924, since the latter has no gold color.

This selects both product codes

[A-Z][A-Z]\d\d\d\d

I thought I needed to add something like

[?=gold)

But that looks directly after the product code. How can I make sure it looks if there is gold BEFORE the next product code "starts"?

Currently have this ugly solution:

[A-Z][A-Z]\d\d\d\d(?=.{0,100}gold)

The fourth bird · Accepted Answer · 2019-11-01T16:21:32.247

Your current approach (?=.{0,100}gold) uses a positive lookahead to assert gold after 0 - 100 characters.

Instead, one option is to use a capturing group (), word boundaries \b and a tempered greedy token approach to match gold before encountering another [A-Z][A-Z]\d{4} pattern.

\b([A-Z][A-Z]\d{4})\b(?:(?![A-Z][A-Z]\d{4}).)*\bgold\b

In parts

\b([A-Z][A-Z]\d{4})\b Match 2 uppercase chars and 4 digits in capturing group 1
(?: Non capturing group
- (?! Negative lookahead, assert what is on the right is not
  - [A-Z][A-Z]\d{4} Match 2 uppercase chars and 4 digits
- ). Close lookahead and match any char except a newline
)* Close non capturing group and repeat 0+ times
\bgold\b Match gold between word boundaries

Regex demo

The values are in group 1.

Regex using positive look ahead not directly after

1 Answers1