1

I have product codes for example HX3923, which always start with 2 capital letters, and end with 4 numbers. Some products have "gold" color, which is somewhere in the text.

Example:

HX3923, width: 0.3, height: 0.7, gold, HX3924, color="blue", width=0.3

I need to match HX3923, but not HX3924, since the latter has no gold color.

This selects both product codes

[A-Z][A-Z]\d\d\d\d

I thought I needed to add something like

[?=gold)

But that looks directly after the product code. How can I make sure it looks if there is gold BEFORE the next product code "starts"?

Currently have this ugly solution:

[A-Z][A-Z]\d\d\d\d(?=.{0,100}gold)
stevebanks
  • 33
  • 5

1 Answers1

0

Your current approach (?=.{0,100}gold) uses a positive lookahead to assert gold after 0 - 100 characters.

Instead, one option is to use a capturing group (), word boundaries \b and a tempered greedy token approach to match gold before encountering another [A-Z][A-Z]\d{4} pattern.

\b([A-Z][A-Z]\d{4})\b(?:(?![A-Z][A-Z]\d{4}).)*\bgold\b

In parts

  • \b([A-Z][A-Z]\d{4})\b Match 2 uppercase chars and 4 digits in capturing group 1
  • (?: Non capturing group
    • (?! Negative lookahead, assert what is on the right is not
      • [A-Z][A-Z]\d{4} Match 2 uppercase chars and 4 digits
    • ). Close lookahead and match any char except a newline
  • )* Close non capturing group and repeat 0+ times
  • \bgold\b Match gold between word boundaries

Regex demo

The values are in group 1.

The fourth bird
  • 154,723
  • 16
  • 55
  • 70