1

I have a sample multi line string where in I have to get all the div tags and contents between them where in the p tag is not equal to a specific id

var str="<div>
         <p id=\"a\">Sample sentence</p>
         </div>

         <div>
         <p id=\"b\">Sample sentence 2</p>
         </div>"

The regex that I was using was too greedy, I only need to match the 2nd div tag and its content but it is also capturing the div tag from above. Here is my regex:

<div>[\s\S]*<p id="b">[\s\S]*<\/div>

for the regex I used it is capturing the entire string but I just want to capture:

<div>
  <p id="b">Sample sentence 2</p>
</div>

any regex guru out there that can help me out with this?

Xavia
  • 95
  • 5
  • 3
    Use a DOM Parser, this is trivial if you are within a browser/node.js, E.g. http://stackoverflow.com/questions/10585029/parse-a-html-string-with-js & many other examples here. – Alex K. May 08 '17 at 18:02
  • 3
    Regexp gurus would advise you not to try to parse/analyze/manipulate DOM with regexp. For instance, it is theoretically impossible to write a regexp which would behave properly in the presence of nested divs. –  May 08 '17 at 18:04
  • Tags are parsable with regular expressions. However, open / close or lack of, and structural relationships between tags is not the forte of regex. –  May 08 '17 at 18:09
  • 1
    You can use `[\s\S]*?` to make the quantifier lazy, but this is not a general purpose solution. Use an HTML/XML parser for reliable results. – 4castle May 08 '17 at 18:10
  • 4castle is correct - it's not that your regex is too greedy, it's that your regex needs to be lazy. – aaaaaa May 08 '17 at 18:13
  • Maybe this [link](https://regex101.com/r/awFFyg/1) can help – Nebojsa Nebojsa May 08 '17 at 18:15

2 Answers2

1

As many will advise: don't use regular expressions to interpret/parse/extract HTML. Instead use the capabilities of the DOM. For example:

var str=`
<div>
  <p id="a">Sample sentence</p>
</div>

<div>
  <p id="b">Sample sentence 2</p>
</div>`;

var elem = document.createElement('span');
elem.innerHTML = str;
elem = elem.querySelector('div:nth-child(2)');
console.log(elem.outerHTML);
Community
  • 1
  • 1
trincot
  • 317,000
  • 35
  • 244
  • 286
1

You can try /<div>\n.*<p id=\\"b\\">.*\n.*<\/div>/g if you have to use RegExp in this case. I would however suggest you to use the DOM Parser if you can.

const regex = /<div>\n.*<p id=\\"b\\">.*\n.*<\/div>/g;
const str = `<div>
         <p id=\\"a\\">Sample sentence</p>
         </div>

         <div>
   <p id=\\"b\\">Sample sentence 2</p>
         </div>`;
let m;

while ((m = regex.exec(str)) !== null) {
    // This is necessary to avoid infinite loops with zero-width matches
    if (m.index === regex.lastIndex) {
        regex.lastIndex++;
    }
    
    // The result can be accessed through the `m`-variable.
    m.forEach((match, groupIndex) => {
        console.log(`Found match, group ${groupIndex}: ${match}`);
    });
}
Piyush
  • 1,162
  • 9
  • 17