-1

Currently I have found a regular expression to find any <tag></tag> and it's contents.

<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>

If I write irrevelant <tag>content</tag> even more irrevelant I get what I want which is the exact tag with it's content <tag>content</tag>.

The issue arises when I try to use this on a nested tag which nests with itself, like:

<tag>gimme cookies<tag>gimme more cookies</tag></tag>

Unfortunately, this time I get:

<tag>gimme cookies<tag>gimme more cookies</tag>

Without the second closing tag.

How could I improve the regex to only find the start and end tag only and the contents between them, so I could nest to infinity and beyond?

  • 2
    please please please dont parse html with regex. HTML is not a regular language – aelor Sep 22 '14 at 05:58
  • Have a look at http://stackoverflow.com/questions/10585029/parse-a-html-string-with-js for suggestions on how to parse a HTML string in JavaScript correctly. – Adam Sep 22 '14 at 05:59
  • Did you make an effort to read the [other 7000 questions on this topic](http://stackoverflow.com/search?q=parse+html+with+regex)? – Marty Sep 22 '14 at 06:06
  • possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Marty Sep 22 '14 at 06:07

2 Answers2

1

I'd recommend the approach taken on Parse a HTML String with JS for a more robust approach rather than sinking time into a complex regex. This reuses the browsers parsing functionality without adding content to your page.

var el = document.createElement( 'div' );
el.innerHTML = "<tag>gimme cookies<tag>gimme more cookies</tag></tag>";

var tags = el.getElementsByTagName( 'tag' );
var i;
for (i = 0; i < tags.length; i++) {
    console.log(tags[i].innerHTML);
}

If you're using jQuery or a modern browser you can filter out exactly what you want with $() or querySelector.

Community
  • 1
  • 1
Adam
  • 35,919
  • 9
  • 100
  • 137
0

Whoa, talk about opening a can of worms. HTML is so irregular that you could go mad trying to handle that with regular expressions.

Let's disregard the possibility that there might be substrings that look like tags but aren't (e. g., in comments or strings). You'd still need a regex engine capable of handling recursion, and JavaScript isn't one of them.

What you can do is make (reasonably, for very loose definitions of reasonable) sure that you only match innermost tags by using

/<([A-Z][A-Z0-9]*)\b[^>]*>(?:(?!\/?\1)[\s\S])*<\/\1>/ig

and then keep matching/replacing until no more matches are left. This of course still requires that all tags are properly nested (and that all opening tags are closed and vice versa, something you hardly ever see in real life).

Test it live on regex101.com.

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561