35

I'm learning an advanced usage of regex and noticed that many posts use (*SKIP) or (*F) in it.

I posted a question where the idea was to match lines that don't have yellow but has blue only if brown exists after blue. And the right answer was:

.*yellow.*(*SKIP)(*F)|^.*\bblue\b(?=.*brown).*$

I also have tried lookaround expressions like below but haven't worked for all the cases:

^((?!yellow).)*blue(?=.*brown).*$

I had no idea about these (*SKIP)(*F) flags, so the question is, how do these flag works? What do they do? And are there other flags like these?

Thanks.

TheLostMind
  • 35,966
  • 12
  • 68
  • 104
Federico Piazza
  • 30,085
  • 15
  • 87
  • 123
  • @SotiriosDelimanolis actually I haven't used it on java pattern I'm testing it on regex101 – Federico Piazza Jul 02 '14 at 15:13
  • @SotiriosDelimanolis yes, you are right but the idea is to implement this regex on java later. First I would like to learn the regex way. Btw, what tag do you recommend me instead of java since I'm doing it on regex101? – Federico Piazza Jul 02 '14 at 15:16
  • @TheLostMind no need to apologies, I don't want to confuse anybody. Thanks for the fix. – Federico Piazza Jul 02 '14 at 15:19
  • 2
    These flags are a feature of Perl Compatible Regular Expressions (PCRE) so I would recommend reading its [documentation](http://www.pcre.org/pcre.txt) (search for the tags you want to know about in the document). In order to use them you'll need to find a regex library that supports them for your language of choice. I do not know of any such library for Java. – SamYonnou Jul 02 '14 at 15:29
  • 1
    @Fede: I don't think `(*SKIP)(*F)` will work on Java. There are other hack ways in Java to get around variable length lookbehind in Java though. – anubhava Jul 02 '14 at 15:33

2 Answers2

67

These two backtracking control verbs are implemented only in Perl, PCRE and the pypi regex module.

The idea of the (*SKIP)(*FAIL) trick is to consume characters that you want to avoid, and that must not be a part of the match result.

A classical pattern that uses of this trick looks like that:

What_I_want_to_avoid(*SKIP)(*FAIL)|What_I_want_to_match

A regex engine processes a string like that:

  • the first token of the pattern is tested on each character from left to right (by default most of the time, but some regex engines can be set to work from right to left, .net can do this if I remember well)

  • if the first token matches, then the regex engine tests the next token of the pattern with the next characters (after the first token match) etc.

  • when a token fails, the regex engine gets the characters matched by the last token back and tries another way to make the pattern succeed (if it doesn't work too, the regex engine do the same with the previous token etc.)

When the regex engine meets the (*SKIP) verb (in this case all previous tokens have obviously succeeded), it has no right anymore to go back to all the previous tokens on the left and has no right anymore to retry all the matched characters with another branch of the pattern or at the next position in the string until the last matched character (included) if the pattern fails later on the right of the (*SKIP) verb.

The role of (*FAIL) is to force the pattern to fail. Thus all the characters matched on the left of (*SKIP) are skipped and the regex engine continues its job after these characters.

The only possibility for the pattern to succeed in the example pattern is that the first branch fails before (*SKIP) to allow the second branch to be tested.

You can find another kind of explanation here.

About Java and other regex engines that don't have these two features

Backtracking control verbs are not implemented in other regex engines and there are no equivalent.

However, you can use several ways to do the same (to be more clear, to avoid something that can be possibly matched by an other part of the pattern).

The use of capture groups:

way 1:

What_I_want_to_avoid|(What_I_want_to_match)

You only need to extract the capture group 1 (or to test if it exists), since it is what you are looking for. If you use the pattern to perform a replacement, you can use the properties of the match result (offset, length, capture group) to make the replacement with classical string functions. Other language like javascript, ruby... allows to use a callback function as replacement.

way 2:

((?>To_avoid|Other_things_that_can_be_before_what_i_want)*)(What_I_want)

It's the more easy way for the replacement, no need to callback function, the replacement string need only to begin with \1 (or $1)

The use of lookarounds:

example, you want to find a word that is not embedded between two other words (lets say S_word and E_word that are different (see Qtax comment)):

(the edge cases S_word E_word word E_word and S_word word S_word E_word are allowed in this example.)

The backtracking control verb way will be:

S_word not_S_word_or_E_word E_word(*SKIP)(*F)|word

To use this way the regex engine needs to allow variable length lookbehinds to a certain extent. With .net or the new regex module, no problems, lookbehinds can have a totally variable length. It is possible with Java too but the size must be limited (example: (?<=.{1,1000})).

The Java equivalent will be:

word(?:(?!not_S_word_or_E_word E_word)|(?<!S_word not_E_word{0,1000} word))

Note that in some cases, only the lookahead is necessary. Note too that starting a pattern with literal character is more efficient than starting with a lookbehind, that's why I putted it after the word (even if I need to rewrite the word one more time in the assertion.)

Dharman
  • 30,962
  • 25
  • 85
  • 135
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • 2
    Excellent explanation. That's the explanation I was looking for. I'm looking for these behaviour in java so I think I'll to post another question. – Federico Piazza Jul 02 '14 at 16:13
  • 1
    @Fede: I will add a complement about Java that doesn't have these features. – Casimir et Hippolyte Jul 02 '14 at 16:14
  • Thanks. this is very useful it's a pitty that java's engine doesn't support this, it is very handy – Federico Piazza Jul 02 '14 at 16:24
  • @CasimiretHippolyte How do I specify anything to the left of What_I_want_to_avoid that I want to match? For ex., I want to match the word 'blah' but not 'blaah'. How do I incorporate`bl` into `aa(*SKIP)(*F)|ah`? – dasf Apr 14 '15 at 21:48
  • @dasf: I don't understand your question "blah" and "blaah" are mutually exclusive. – Casimir et Hippolyte Apr 14 '15 at 22:10
  • 1
    *"The use of lookarounds"* example is not completely correct, the lookahead expression doesn't match the same strings as the `(*SKIP)(*F)` expression does, nor does it do what the example text states. For example `word` in the string `word E_word` should match, but the lookahead expression doesn't match it. – Qtax Aug 10 '15 at 14:48
  • Another lookaround problem: Say you want to match all `\w+` that are not directly quoted, which is defined by `"\w+"(*SKIP)(?!)|\w+`. Converting it to lookarounds using this approach would give you something like `\w+(?:(?![^"\w]*")|(?<!"[^"\w]{0,1000}\w{0,1000}))` (or just `(\w+)(?:(?!")|(?<!"\1))`). For the string `"foo"bar"baz"` the skip expression would match `bar` as it should (`bar` is between two quoted strings), but the lookaround expression will not match anything at all. – Qtax Aug 11 '15 at 08:25
  • @Qtax: indeed, but this time it is a particular case where S_Word and E_Word are the same, so I will not cover it in this answer, but I will add that S_Word and E_Word must be different. – Casimir et Hippolyte Aug 16 '15 at 12:18
  • @CasimiretHippolyte, thanks for being so descriptive. You also have helped me in the past on a difficult problem! For this one I was wondering if there is a way to mimic this in javascript regex? – antoni Apr 10 '17 at 15:03
  • @antoni: You can't "mimic" `(*SKIP)(*F)` in javascript, however my answer already contains workarounds to deal with regex engines that don't have this feature. – Casimir et Hippolyte Apr 11 '17 at 16:12
  • I know this is not in line with then guidelines for comments, but this is a fantastic explanation! – Mihai Dec 06 '21 at 21:39
7

The (*SKIP) and (*F) (aka *FAIL) patterns are documented in the Perl manual: http://perldoc.perl.org/perlre.html

However, they are only available in Perl and in flavours of regex that imitate Perl (for example the PCRE library used by PHP).

Java's built in regex engine does not support these extensions, and I am not aware of one that does.

My general advice in Java is to keep your regular expressions simple, and use other string manipulation methods to achieve what can't be done clearly with a short regex.

slim
  • 40,215
  • 13
  • 94
  • 127