JavaScript Regular Expression - strange results

Question

I have a sample string below:

var nodeTest = "<node1>xxxxxxx<x> xxx</node1>xxx xx</x>x x x xxxxxx <node2>xx x x xxxxxxxxxx</node2> xxxxx";

I am trying to match all nodes with and without white spaces so anything (numbers, white space, text and all characters - think everything!) between <> including <>. I have tried many configurations but they don't seem to work intuitively, my most recent bit of logic being this:

var nodePairs = nodeTest.match(/<(.*)>/gi);

But it matches the ENTIRE test string. Can anyone offer any clues as to where I might be going wrong? Thanks!

I am not sure what your main intentions are but in case you have difficulty with really complex xmls/htmls then I would recommend that you use jquery to parse them example - $("xxxxxxx xxxxxx xxx x x xxxxxx xx x x xxxxxxxxxx xxxxx") — tusharmath, Jun 20 '13 at 17:52
I will give jquery a go later thanks -my objective was to pick all nodes out of a large body of text. — user1360809, Jun 20 '13 at 18:01
possible duplicate of [Greedy vs. Reluctant vs. Possessive Quantifiers](http://stackoverflow.com/questions/5319840/greedy-vs-reluctant-vs-possessive-quantifiers) — HamZa, Jan 27 '14 at 22:32

score 3 · Accepted Answer · answered Jun 20 '13 at 17:44

3

.* is greedy - meaning it'll match as much as possible (in this case, from the first < to the last >).

If you want lazy search, use .*? to match as little as possible.

answered Jun 20 '13 at 17:44

h2ooooooo

39,111
8
68
102

this cured it thanks! Another lesson learned! – user1360809 Jun 20 '13 at 17:59

score 1 · Answer 2 · edited Jun 20 '13 at 17:48

1

The .* means that the . is greedy, it'll match as much as it can, and that explains your results.

What you probably want to get, is this regex:

<node(\d)>(.*?)<\/node\1>

The result you want is in the second captured group. See how it works here.

The \1 by the way refers to the first captured group.

If you have nodes with higher numbers than node0 to node9, then you'd prefer:

<node(\d+)>(.*?)<\/node\1>

edited Jun 20 '13 at 17:48

h2ooooooo

39,111
8
68
102

answered Jun 20 '13 at 17:46

Jerry

70,495
13
100
144

1

May I suggest `\d+`? :-) – h2ooooooo Jun 20 '13 at 17:47
@h2ooooooo Sure, I'll add that as well. – Jerry Jun 20 '13 at 17:47
Thanks for the code example! I don't see how `\d+` would work as isn't that digits only? – user1360809 Jun 20 '13 at 18:00
@user1360809 Don't you have "node" and a number attached to it? Also, simply changing the greedy `.*` to `.*?` won't get you the desired results since you have `>` characters inside the nodes. – Jerry Jun 20 '13 at 18:02
`node\d` means "*the string node followed by a single digit*" whereas `node\d+` means "*the string node followed by a digit repeated 1 or more times*" (essentially any number except for numbers with decimals) – h2ooooooo Jun 20 '13 at 18:02
ah, good point! the node and number is only a test, but is it not (not) correct html/xml to have nested <> inside <>, you would have to use a different representation, no? – user1360809 Jun 20 '13 at 18:09
Ahh, since it's for html/xml. Yea, you'd have the `<>` symbols as `<` and `>` instead. – Jerry Jun 20 '13 at 18:44

score 0 · Answer 3 · answered Jun 20 '13 at 18:12

0

A good way is to use a binding character class to avoid the greedy/lazy problem:

/<([^>]+)>/

answered Jun 20 '13 at 18:12

Casimir et Hippolyte

88,009
5
94
125

I tried something similar to this but it was more like `/<[^]>/` – user1360809 Jun 20 '13 at 18:24

JavaScript Regular Expression - strange results

3 Answers3