0

Is it possible to exclude parts of regular expression matches? Take this scenario as an example:

FREE SOFT FOUNDATION V2 1989 PAGE 2
STALLMANWORKS 2000 1977;PAGE 2
THE GNU PAGE 3 1977

I'm trying to match just FREE SOFT FOUNDATION, STALLMANWORKS 2000 and THE GNU. That's easy, but now I have to exclude any combination of [0-9;]+\s?(PAGE) that comes after the title. I tried a negative lookahead, but had no luck:

(?!([0-9]+\s?(PAGE)))([A-Z0-9\s]+)
vinnylinux
  • 7,050
  • 13
  • 61
  • 127

3 Answers3

0

I'm not so sure what might be desired here, my guess is that maybe this expression

([\s\S].*?)\b((?:\s*\d+\s+;?|\s*\d+;)PAGE\s+\d+|\s*PAGE.*[0-9])

may be OK to look into. Here, we would see what we would like to exclude then we would simply add:

 ([\s\S].*?)

to collect our desired chars.

Demo

Community
  • 1
  • 1
Emma
  • 27,428
  • 11
  • 44
  • 69
  • 1
    Note that it is matching "STALLMANWORKS" instead of "STALLMANWORKS 2000". This is one of my challenges. :( – vinnylinux Jul 01 '19 at 19:19
0

You need to pair the negative lookahead with every character you match. Your example regexp does the negative lookahead check just at the first character.

Something like:

((?:(?!\s+V?[0-9]|\s+PAGE)[A-Z0-9\s])+)
Jonas Berlin
  • 3,344
  • 1
  • 27
  • 33
0

If you only want to get those matches, you might use an anchor ^ to assert the start of the string;

In your example data it seems you don't want digits before page.

Perhaps you could use a tempered greedy token approach to assert what is on the right is not PAGE and then match any of the character class [A-Z0-9\s].

Then make sure that the match end with an uppercase A-Z followed by a word boundary \b an can optionally match 4 digits after ending on an uppercase A-Z.

^(?:(?! PAGE)[A-Z0-9\s])+[A-Z](?: \d{4})?\b

Explanation

  • ^ Start of string
  • (?: Non capturing group
    • (?! PAGE) Negative lookahead, assert what is directly on the right is not
    • [A-Z0-9\s] Match any of the listed in the character class
  • )+ Close non capturing group and repeat 1+ times
  • [A-Z] Match uppercase A-Z followed by a word boundary
  • (?: \d{4})? Optionally match a space and 4 digits
  • \b Word boundary

Regex demo

The fourth bird
  • 154,723
  • 16
  • 55
  • 70