-1

I am given a string containing a comma-separated list of words (where whitespace and case are not significant) and I want a Perl regexp to test the following: the string contains the (complete) word "french" and the (complete) word "english" does not occur earlier. For instance, I want to accept "french", "foobar, french", "bar, french, quux, english", "french, english, french"; but reject "foo, bar", "english, french", "foo, english, bar, french, english".

My goal is to use a regexp of this kind in a lighttpd configuration. To be precise, I want to parse Accept-Language headers, with the naive heuristics that languages are listed in decreasing preference order, which is often true although not prescribed by the RFC. Hence, I can only have a Perl compatible regular expression, I cannot use any other features of Perl.

In terms of formal language theory, such a regular expression must exist, but the straightforward solution requires regexp negation, which is painful to perform. (This is why I ask the question with "french" and "english" rather than "fr" and "en", where regexp negation would be tedious but doable by hand.) Are there any Perl-specific regexp features to make it possible to write a concise regexp for my task, or is there a tool to automatically compile a regexp to perform this?

a3nm
  • 8,717
  • 6
  • 31
  • 39
  • 1
    yes, you can. it's called look-ahead assertions. they let you express "foo not followed by bar". conversly, there's look-behind as well: "foo preceeded by bar" – Marc B Jan 05 '15 at 18:02
  • This can be done with a regular expression as the other comments/answers show, but why? It would be simpler, more efficient, and easier to code to just iterate over the words in the list (delimited by commas) and compare them to your two target words... – twalberg Jan 05 '15 at 18:32
  • 1
    @twalberg The OP already addressed that: "My goal is to use a regexp of this kind in a lighttpd configuration." They aren't writing a full-fledged Perl script. – ThisSuitIsBlackNot Jan 05 '15 at 18:36

1 Answers1

1

Something like this should work

Update
Fail on first 'English' before 'French' only its:

 # /(?i)^(?:(?!\benglish\b).)*?\bfrench\b/

 (?i)                          # Case insensitive
 ^                             # BOS
 (?:
      (?! \b english \b )
      . 
 )*?
 \b french \b                  # 'french'

Original:
Fail on any 'English' before 'French'

 # /(?i)^(?!.*\benglish\b.*\bfrench\b).*\bfrench\b/

 (?i)                          # Case insensitive
 ^                             # BOS
 (?!                           # Not 'english' .. 'french'
      .* 
      \b english \b 
      .* 
      \b french \b 
 )
 .* 
 \b french \b                  # Must contain 'french' 
  • This seems to fail on "french,english,french", which is rejected whereas it should be accepted. I tried to rewrite it using look-behind as `(?i)(?<!\benglish\b.*)\bfrench\b` but this doesn't work: "Variable length lookbehind not implemented in regex". – a3nm Jan 06 '15 at 00:00
  • @a3nm You should be able to replace the negative look-behind with something like `(?:(?!english).)*`. See [regexps: variable-length lookbehind-assertion alternatives](http://stackoverflow.com/a/11640500/176646) for details. – ThisSuitIsBlackNot Jan 06 '15 at 16:06