How to properly use lookahead in Ruby Regex?

Question

I am expecting this to match only the first instance of <style, because the second one, after the space has the pattern that I have put in the negative lookahead.

"<style type=\"text/html\">ciaoxocs <style />".scan /<style\s?(?!\/>)/
# => ["<style ", "<style"]

I want to an explanation for what is happening here, and possibly a better solution to match only the first instance without matching the closing tag with or without space:

<style /> or <style/>

In regex101.com, it works as expected with other langs:

https://www.regex101.com/r/pW2oM3/1

Your problem is the optional space. If you want to make the space optional, you also have to add it to the lookahead. — ndnenkov, Oct 01 '15 at 12:34
@ndn you are rright, even if I don't understand why my is wrong — ciaoben, Oct 01 '15 at 12:49
I just wonder if what you are doing can be easier achieved with [Nokogiri](http://ruby.bastardsbook.com/chapters/html-parsing/). — Wiktor Stribiżew, Oct 01 '15 at 12:50
Obligatory reading: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Max, Oct 01 '15 at 19:38

Casimir et Hippolyte · Answer 1 · 2015-10-01T13:44:24.927

The problem comes from the backtracking mechanism. Let's see the description of what happens with the closing tag:

<script\s? matches "<script " but the (?!/>) fails. In this situation the backtracking mechanism begins and quantifiers give back their characters one by one until the pattern succeeds. In our case, the only possibility is to give back the space from \s?.
After this backtracking step, <script\s? matches "<script" (without the space this time) and the (?!/>) condition succeeds with " />".

There are several possibilities to prevent this mechanism:

using an atomic group (?>...) (that forbids backtracking for the sub-pattern once the closing parenthesis is reached): <script(?>\s?)(?!/>)
using a possessive quantifier ?+ (that forbids backtracking for the quantifier): <script\s?+(?!/>)
including the space in the lookahead: <script(?!\s?/>)\s?

sawa · Accepted Answer · 2015-10-01T12:53:53.563

Notice that the second match (which comes from <style />) is <style, (without space) and not <style (ending with space; somehow you cannot see the difference here). Your negative lookahead (?!\/>) in /<style\s?(?!\/>)/ only prohibits \/> coming right after the substring matching <style\s?. If the matched string that corresponds to this part of the regex is <style (without the space), then what immediately follows it in the original string is the space (and not \/>), so the negative condition is satisfied.

If you are sure that the pattern you want to match always has a space, then you can simply make the space obligatory, and you will get only what you want:

"<style type=\"text/html\">ciaoxocs <style />".scan /<style\s(?!\/>)/
# => ["<style "]

If you cannot be sure about that, then move the optional space into the negative lookahead.

"<style type=\"text/html\">ciaoxocs <style />".scan /<style(?!\s?\/>)/
# => ["<style"]

Why should I hate this? anyway good answer, I have understand what I was missing — ciaoben, Oct 01 '15 at 13:02

joanbm · Answer 3 · 2015-10-01T12:53:22.977

-2

You probably want use String#match, instead of String#scan which iteratively applies pattern until end of string is reached.

> "<style type=\"text/html\">ciaoxocs <style />".match(/<style\s?(?!\/>)/).to_a
=> ["<style "]

edited Oct 01 '15 at 12:53

answered Oct 01 '15 at 12:42

joanbm

793
1
6
13

1

it not what I am asking – ciaoben Oct 01 '15 at 12:49
Really ? It exactly matches required result [linked](https://www.regex101.com/r/pW2oM3/1) in your question. – joanbm Oct 01 '15 at 12:54
If you did expect different result, put *exactly* which part of the original string should be matched. – joanbm Oct 01 '15 at 12:57
1

@joanbm The OP is not asking about the result. The OP is asking **what is happening** with the OP's code, and a better way to match only the first one (using `scan`). – sawa Oct 01 '15 at 12:59
From the OP *"In regex101.com, it works as expected with other langs:"*. He compares apples (single regexp match) with oranges (repeated regexp matching), so it is why gets different results. I find the right answer mention this misunderstanding what `#scan` really does. – joanbm Oct 01 '15 at 13:19

How to properly use lookahead in Ruby Regex?

3 Answers3