Javascript match regex - prevent greediness

Question

I am having an issue with my javascript match() regex.

<div class="a">       whitespace, new lines, and  content    </div>
<div class="junk">    junkjunkjunk                           </div>
<div class="a">       whitespace, new lines, and  content    </div>
<div class="junk">    junkjunkjunk                           </div>
<div class="a">       whitespace, new lines, and  content    </div>

Let's say I want to capture everything in between <div class="a"> and the closest </div>. The following regex is capturing everything, I'm assuming due to greediness:

/<div class="a">[\s\S]+<\/div>?/ig

I want to capture each <div class="a">...</div> individually such that I can output each as capture[0], capture[1], etc. How would I do this?

Thank you.

EDIT: Updated to better reflect my problem. Assume there is undesired markup and text between desired divs.

HTML cannot be parsed with regex. You can use it in very specific situations, and this is not one of them. — Rodrigo, Jun 26 '11 at 21:35

Rodrigo · Accepted Answer · 2011-06-26T19:51:41.873

2

First, parsing HTML with regex is baaad... seriously man, you can use the innerHTML property of each div to change it's content, or better, use jQuery or another javascript framework to do this kind of jobs.

This job can be made with jquery in this way:

$("div.a").each(
  function() {
    alert($(this).html())
  }
);

Second, if you want badly to use regex, and assuming there is only text (no markup) between the divs, you can use something like this:

/<div class="a">([^<])+<\/div>/ig

edited Jun 26 '11 at 19:51

answered Jun 26 '11 at 19:39

Rodrigo

4,365
3
31
49

Is it bad because it takes longer? Are there other concerns? – John Smith Jun 26 '11 at 19:48
+1, @John Smith, Please see the answer to [this SO post](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) for a very informative explanation of why "parsing HTML with regex is baaad" – smartcaveman Jun 26 '11 at 19:56
1

bad because regular expression works with regular languages, and html is not regular. Please see further information in http://en.wikipedia.org/wiki/Regular_language, http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html. Also in SO, discussed a lot of times: http://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not, http://stackoverflow.com/questions/133601/can-regular-expressions-be-used-to-match-nested-patterns, http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Rodrigo Jun 26 '11 at 19:56
@John it is like hammering a screw. Also you if you want to change it's much easier to use a framework like jquery. – dr jerry Jun 26 '11 at 20:00

score 2 · Answer 2 · answered Jun 26 '11 at 20:25

2

To give a straight regex answer:

To remove the greediness of the quantifiers, put a ? after the quantifier like this:

/<div class="a">[\s\S]+?<\/div>?/ig

This forces the + to match as less as possible. Works also with the *.

answered Jun 26 '11 at 20:25

stema

90,351
20
107
135

score 1 · Answer 3 · answered Jun 26 '11 at 19:38

1

then you need the question mark before the closing div but after the + operator, and use () around what you want to capture.

answered Jun 26 '11 at 19:38

rkulla

2,494
1
18
16

This sort of looks like it works...but why does it match [only every other line](http://bit.ly/iN8rTq)? – Jun 26 '11 at 21:23

user113716 · Answer 4 · 2011-06-26T19:49:39.373

One way to prevent regex greediness, is to not use regex.

If you'll allow for an alternate solution. This assumes your HTML is in string form, and not part of the DOM:

var str = '<div class="a">       whitespace, new lines, and  content    </div>\
<div class="a">       whitespace, new lines, and  content    </div>\
<div class="a">       whitespace, new lines, and  content    </div>';

var temp = document.createElement('div');
temp.innerHTML = str;

var capture = [];

for( var i = 0; i < temp.childNodes.length; i++ ) {
    var node = temp.childNodes[i];
    if( node && node.nodeType === 1 && node.className === 'a' ) {
        capture.push( node.innerHTML );
    }
}

alert(capture[0]);

With respect to a regex, here's one approach using .replace():

var str = '<div class="a">       whitespace, new lines, and  content    </div>\
<div class="a">       whitespace, new lines, and  content    </div>\
<div class="a">       whitespace, new lines, and  content    </div>';

var res = [];

str.replace(/<div class="a">([^<]+)<\/div>/ig,function(s,g1) {
    res.push(g1);
});

Javascript match regex - prevent greediness

4 Answers4