12

Disclosure: I have read this answer many times here on SO and I know better than to use regex to parse HTML. This question is just to broaden my knowledge with regex.

Say I have this string:

some text <tag link="fo>o"> other text

I want to match the whole tag but if I use <[^>]+> it only matches <tag link="fo>.

How can I make sure that > inside of quotes can be ignored.

I can trivially write a parser with a while loop to do this, but I want to know how to do it with regex.

kaya3
  • 47,440
  • 4
  • 68
  • 97
steve
  • 3,878
  • 7
  • 34
  • 49

4 Answers4

16

Regular Expression:

<[^>]*?(?:(?:('|")[^'"]*?\1)[^>]*?)*>

Online demo:

http://regex101.com/r/yX5xS8

Full Explanation:

I know this regex might be a headache to look at, so here is my explanation:

<                      # Open HTML tags
    [^>]*?             # Lazy Negated character class for closing HTML tag
    (?:                # Open Outside Non-Capture group
        (?:            # Open Inside Non-Capture group
            ('|")      # Capture group for quotes, backreference group 1
            [^'"]*?    # Lazy Negated character class for quotes
            \1         # Backreference 1
        )              # Close Inside Non-Capture group
        [^>]*?         # Lazy Negated character class for closing HTML tag
    )*                 # Close Outside Non-Capture group
>                      # Close HTML tags
David Passmore
  • 6,089
  • 4
  • 46
  • 70
Vasili Syrakis
  • 9,321
  • 1
  • 39
  • 56
  • Are you sure `[^\1]` does what you think it does? I don't think `\1` can be used inside a character class like that. – zrajm Mar 04 '14 at 06:39
  • You are correct! How silly of me. I changed it so that it is a negated character class for quotes, since that is all we will ever match with group 1 anyway. Although this introduces problems when we begin to have single quotes inside double quotes... I do have a solution for that, but it's very long. – Vasili Syrakis Mar 04 '14 at 06:43
  • Amazingly gorgeous and beautiful piece of regex! Thanks! – steve Mar 04 '14 at 07:05
  • Why not simply have to regexes? One for `"` delimited args, and one for `'`? Then try the second regex only if the second did not match? – zrajm Mar 04 '14 at 07:44
  • or, for that matter you could use `(?:'[^']*'|"[^"]*")` (instead of `(?:('|")[^'"]*?\1)`). (You really don't need the `*?` qualifier inside the quotes. The match will always be the same here regardless of whether you use `*` or `*?`.) – zrajm Mar 04 '14 at 07:45
1

This is a slight improvement on Vasili Syrakis answer. It handles "…" and '…' completely separately, and does not use the *? qualifier.

Regular expression

<[^'">]*(("[^"]*"|'[^']*')[^'">]*)*>

Demo

http://regex101.com/r/jO1oQ1

Explanation

<                    # start of HTML tag
    [^'">]*          #   any non-single, non-double quote or greater than
    (                #   outer group
        (            #     inner group
            "[^"]*"  #       "..."
        |            #      or
            '[^']*'  #       '...'
        )            #
        [^'">]*      #   any non-single, non-double quote or greater than
    )*               #   zero or more of outer group
>                    # end of HTML tag

This version is slightly better than Vasilis's in that single quotes are allowed inside "…", and double quotes are allowed inside '…', and that a (incorrect) tag like <a href='> will not be matched.

It is slightly worse than Vasili's solution in that the groups are captured. If you do not want that, replace ( with (?:, in all places. (Just using ( makes the regex shorter, and a little bit more readable).

zrajm
  • 1,361
  • 1
  • 12
  • 21
0
(<.+?>[^<]+>)|(<.+?>)

you can make two regexs than put them togather by using '|', in this case :

(<.+?>[^<]+>)   #will match  some text <tag link="fo>o"> other text
(<.+?>)         #will match  some text <tag link="foo"> other text

if the first case match, it will not use second regex, so make sure you put special case in the firstplace.

宏杰李
  • 11,820
  • 2
  • 28
  • 35
0

If you want this to work with escaped double quotes, try:

/>(?=((?:[^"\\]|\\.)*"([^"\\]|\\.)*")*([^"\\]|\\.)*$)/g

For example:

const gtExp = />(?=((?:[^"\\]|\\.)*"([^"\\]|\\.)*")*([^"\\]|\\.)*$)/g;
const nextGtMatch = () => ((exec) => {
    return exec ? exec.index : -1;
})(gtExp.exec(xml));

And if you're parsing through a bunch of XML, you'll want to set .lastIndex.

gtExp.lastIndex = xmlIndex;
const attrEndIndex = nextGtMatch(); // the end of the tag's attributes
qel
  • 944
  • 7
  • 7