Javascript [\s\S]* is too greedy

Question

I have a sample multi line string where in I have to get all the div tags and contents between them where in the p tag is not equal to a specific id

var str="<div>
         <p id=\"a\">Sample sentence</p>
         </div>

         <div>
         <p id=\"b\">Sample sentence 2</p>
         </div>"

The regex that I was using was too greedy, I only need to match the 2nd div tag and its content but it is also capturing the div tag from above. Here is my regex:

<div>[\s\S]*<p id="b">[\s\S]*<\/div>

for the regex I used it is capturing the entire string but I just want to capture:

<div>
  <p id="b">Sample sentence 2</p>
</div>

any regex guru out there that can help me out with this?

Use a DOM Parser, this is trivial if you are within a browser/node.js, E.g. http://stackoverflow.com/questions/10585029/parse-a-html-string-with-js & many other examples here. — Alex K., May 08 '17 at 18:02
Regexp gurus would advise you not to try to parse/analyze/manipulate DOM with regexp. For instance, it is theoretically impossible to write a regexp which would behave properly in the presence of nested divs. — , May 08 '17 at 18:04
Tags are parsable with regular expressions. However, open / close or lack of, and structural relationships between tags is not the forte of regex. — , May 08 '17 at 18:09
You can use `[\s\S]*?` to make the quantifier lazy, but this is not a general purpose solution. Use an HTML/XML parser for reliable results. — 4castle, May 08 '17 at 18:10
4castle is correct - it's not that your regex is too greedy, it's that your regex needs to be lazy. — aaaaaa, May 08 '17 at 18:13

score 1 · Answer 1 · edited May 23 '17 at 12:10

1

As many will advise: don't use regular expressions to interpret/parse/extract HTML. Instead use the capabilities of the DOM. For example:

var str=`
<div>
  <p id="a">Sample sentence</p>
</div>

<div>
  <p id="b">Sample sentence 2</p>
</div>`;

var elem = document.createElement('span');
elem.innerHTML = str;
elem = elem.querySelector('div:nth-child(2)');
console.log(elem.outerHTML);

edited May 23 '17 at 12:10

Community

1
1

answered May 08 '17 at 18:12

trincot

317,000
35
244
286

score 1 · Answer 2 · answered May 08 '17 at 18:14

You can try /<div>\n.*<p id=\\"b\\">.*\n.*<\/div>/g if you have to use RegExp in this case. I would however suggest you to use the DOM Parser if you can.

const regex = /<div>\n.*<p id=\\"b\\">.*\n.*<\/div>/g;
const str = `<div>
         <p id=\\"a\\">Sample sentence</p>
         </div>

         <div>
   <p id=\\"b\\">Sample sentence 2</p>
         </div>`;
let m;

while ((m = regex.exec(str)) !== null) {
    // This is necessary to avoid infinite loops with zero-width matches
    if (m.index === regex.lastIndex) {
        regex.lastIndex++;
    }
    
    // The result can be accessed through the `m`-variable.
    m.forEach((match, groupIndex) => {
        console.log(`Found match, group ${groupIndex}: ${match}`);
    });
}

Javascript [\s\S]* is too greedy

2 Answers2