2

Applying this regex pattern:

/(?:(^| |\>|\+))+([a-z\-\_]+)/gi

to this string:

body.test ol+li ol > li #foobar p>span a[href=*]

I get these matches, comma separated:

body, ol,+li, ol, > li, p,>span, a

Why do some matches have the leading space , > and + sign? I'd expect this part of my regex (?:(^| |\>|\+)) to match those signs, but not capture them.

Edit: I am trying to match html tags and css selectors contributing to css specificity of an css selector. Thus I want to match each li or span or so forth on its own, without the + or >.

kontur
  • 4,934
  • 2
  • 36
  • 62

4 Answers4

3

Capturing is not the same as matching. Since you're specifying the combinators in your pattern, they will be picked up by the matcher, regardless of whether they're captured or non-captured.

To capture, you need to exec() your regular expression on the string and loop through the results, which will contain your capture groups. I've also cleaned up your pattern and modified it so it doesn't capture unnecessarily and will recognize the general sibling combinator ~:

var sel = "body.test ol+li ol > li #foobar p>span a[href=*]";
var re = /(?:^| |>|\+|~)+([a-z_-]+)/gi;
var matches = [], m;

while (m = re.exec(sel)) {
    matches.push(m[1]);
}

You will then obtain the expected matches:

body, ol, li, ol, li, p, span, a
BoltClock
  • 700,868
  • 160
  • 1,392
  • 1,356
  • Perfect explanation and answer. Debugging the content of `m` in the while loop helped me understand what is going on. I am using it in [a small javascript module that calculates css specifity](https://github.com/johannesneumeier/css-specifity/blob/master/specifity.js). – kontur Feb 09 '13 at 21:29
  • @kontur: You may want to rename it - the correct spelling is "specificity" (that's "specific" + "ity") :) – BoltClock Feb 10 '13 at 03:33
2

The inner bracket in (?:(^| |\>|\+)) is creating a captured group. You can just make it non-capturing too, and I think, you should have the + quantifier inside the outer bracket:

/(?:(?:^| |\>|\+)+)([a-z\-\_]+)/gi

Also, you can use a character class instead to avoid having those pipes in between, and also you won't then need to escape > and +. But remember, not to use caret(^) at the beginning of the character class, else it will negate everything:

/(?:[ >+^]+)([a-z_-]+)/gi

You don't need to escape - and _ in a character class. Just use the - at the end, and all is fine.

Rohit Jain
  • 209,639
  • 45
  • 409
  • 525
  • Hm, weird, when trying to figure out what's wrong I tried that on this online regex tester: http://regex.larsolavtorvik.com/ and it seems to be faulty - it still shows the matches with the `+` etc. Trying the same on http://rubular.com/ your suggestion works. Problem solved, should avoid the first regexp tool then. Also thanks for your suggestion about the character class! – kontur Feb 09 '13 at 11:35
  • `(?:^|[ >+])` this works, seems like the `^` inside the class would indicate the negation, rather than the line beginning. – kontur Feb 09 '13 at 11:36
  • @kontur.. No, when you use `^` at the beginning, only then it denotes negation. Else it is just the `^` only. Anyways, I have updated the first regex, you needed to use nested non-capturing group there since you were using quantifier. – Rohit Jain Feb 09 '13 at 11:37
  • great insights here, thanks! Could there be some difference in regex handling here? The online javascript regex tool as well as my firefox seem to still match those `+` signs, yet the ruby regex online tool does not match them. Or am I missing something here? Also, using your regexp `(?:[ >+^]+)([a-z_-]+)` at http://regex.larsolavtorvik.com/ does not match the starting `body` tag, which is why I thought the `^` needs to be outside the group. – kontur Feb 09 '13 at 11:52
  • regex.larsolavtorvik.com on the javascript tab, as well as Firefox. – kontur Feb 09 '13 at 11:54
  • @kontur.. What is your exact requirement? – Rohit Jain Feb 09 '13 at 11:57
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/24231/discussion-between-kontur-and-rohit-jain) – kontur Feb 09 '13 at 11:59
  • @kontur.. Try out this: - `(((?![ >+^]).)+)` – Rohit Jain Feb 09 '13 at 12:00
0

You have capturing group here: (^| |\>|\+).

Mikhail Vladimirov
  • 13,572
  • 1
  • 38
  • 40
0

You have two capturing groups, (^| |\>|\+) and ([a-z\-\_]+) - the first one directly inside of a non-capturing group. Just remove it:

/(?:^| |>|\+)+([a-z_-]+)/gi

On how to get the captured groups while repeatedly (global) matching see JavaScript regular expressions and sub-matches. Btw, you could as well try to use .split(/[ >+]+/) or .match(/[^ >+]+/g).

Community
  • 1
  • 1
Bergi
  • 630,263
  • 148
  • 957
  • 1,375
  • @RohitJain has pointed this out already. Somehow though I still get matches that include the `+` and `>` signs, as well as the spaces. – kontur Feb 09 '13 at 12:12
  • The whole match of course does include all the characters. The first (and only) capturing group will not. How do you apply the regex? – Bergi Feb 09 '13 at 12:15
  • `"body, ol,+li, ol, > li, p,>span, a".match(/(?:(^| |\>|\+))+([a-z\-\_]+)/gi);` – kontur Feb 09 '13 at 12:23
  • @kontur: That's matching, not capturing. Matching always picks up a subpattern whether or not you're capturing it. – BoltClock Feb 09 '13 at 13:08
  • @BoltClock yes, I seemed to miss that subtle difference - @RohitJain explained it to me in chat. It still seems very weird that `"Lorem ipsum dol solor".match(/(?: )/gi)` would *match* three empty spaces. Really a bit puzzling. How can I access *captured* as opposed to *matched* groups in javascript? – kontur Feb 09 '13 at 13:11
  • @kontur: I've posted an answer. (Actually two, but one had the completely unrealistic assumption that you had to use matching instead of trying to capture - that's already deleted.) – BoltClock Feb 09 '13 at 13:34