1

I just need someone to correct my understanding of this regex , which is like a stopgap arrangement for matching HTML tags.

< (?: "[^"]*" ['"]* | '[^']*'['"]*|[^'">])+ >

My understanding -

  • < -Match the tag open symbol
  • (?: - Cant understand whats going on here . What do these symbols mean?
  • "[^"]*['"]* An arbitrary string in double quotes. Something else going here ?
  • '[^']*'['"]* - Some string in single quotes
  • [^'">] - Any character other than ' " >.

So its a '<' symbol .Followed by a string in double quotes or in single quotes or any other string which dosent contain ' " or > , repeated one or more times followed by a '>' .
Thats the best I could make out .

Samhan Salahuddin
  • 2,140
  • 1
  • 17
  • 24
  • 1
    I think your understanding looks sound. But with all things Regex you should get yourself a 'regular expessions tester' and check a few scenarios to be sure (I use a firefox plugin that does the job). – Stewart Ritchie Oct 04 '12 at 07:39

3 Answers3

5
<       # literally just an opening tag followed by a space
(       # the bracket opens a subpattern, it's necessary as a boundary for
        # the | later on
?:      # makes the just opened subpattern non-capturing (so you can't access it
        # as a separate match later
"       # literally "
[^"]    # any character but " (this is called a character class)
*       # arbitrarily many of those (as much as possible)
"       # literally "
['"]    # either ' or "
*       # arbitrarily many of those (and possible alternating! it doesn't have
        # to be the same character for the whole string)
|       # OR
'       # literral *
[^']    # any character but ' (this is called a character class)
*       # arbitrarily many of those (as much as possible)
'       # literally "
['"]*   # as above
|       # OR
[^'">]  # any character but ', ", >
)       # closes the subpattern
+       # arbitrarily many repetitions but at least once
>       # closing tag

Note that all the spaces in the regex are treated just like any other character. They match exactly one space.

Also take special note of the ^ at the beginning of character classes. It's not treated as a separate character, but inverts the whole character class.

I may also (obligatorily) add, that regular expressions are not appropriate to parse HTML.

Community
  • 1
  • 1
Martin Ender
  • 43,427
  • 11
  • 90
  • 130
  • Thanks for the great answer , non capturing subpattern... googling into it – Samhan Salahuddin Oct 04 '12 at 08:17
  • That's probably a good idea. It's quite a powerful concept when you want to extract data from within larger structurs or you need to replace those structures but keep the data within (using regex). – Martin Ender Oct 04 '12 at 08:18
  • Theres just one more thing I cant understand ....The pattern `"[^"]*"` `['"]*` should match "some random stuff here" , but why is there `['"]*` at the end ? Does the * apply to the whole expression or to just the character set `['"]` ? – Samhan Salahuddin Oct 04 '12 at 08:20
  • It only applies to the character class `['"]`. I'm not really sure what its purpose is because these characters are already taken care of by the third part of the alternation (the part after the second `|`). Also note, that this regex does NOT match self-closing tags, because they have no space in front of the close `>`. – Martin Ender Oct 04 '12 at 08:23
  • I found this regex on one of the answers in the post you linked in your above answer . Since I'm just beginning with regex I needed clarification on it. Since we're both stumped , I guess it must be there to handle some obscure corner case we cant think of . – Samhan Salahuddin Oct 04 '12 at 08:33
  • Oh, the one with 417 upvotes? Be aware that by adding the spaces, yours is generally different (a bit). The original one **did** match self-closing tags. – Martin Ender Oct 04 '12 at 08:36
2

Split it up by the |s, which denote ors:

<
  (?:
    "[^"]*" ['"]* |
    '[^']*'['"]* |
    [^'">]
  )+
>

(?: denotes a non-matching group. The insides of that group match these things (in this order):

  1. "stuff"
  2. 'stuff'
  3. asd=

In effect, this is a regex that attempts to match HTML tags with attributes.

Blender
  • 289,723
  • 53
  • 439
  • 496
0

Here is the result of YAPE::Regex::Explain

(?-imsx:< (?: "[^"]*" ['"]* | '[^']*'['"]*|[^'">])+ >)

matches as follows:

NODE                     EXPLANATION
----------------------------------------------------------------------
(?-imsx:                 group, but do not capture (case-sensitive)
                         (with ^ and $ matching normally) (with . not
                         matching \n) (matching whitespace and #
                         normally):
----------------------------------------------------------------------
  <                        '< '
----------------------------------------------------------------------
  (?:                      group, but do not capture (1 or more times
                           (matching the most amount possible)):
----------------------------------------------------------------------
     "                       ' "'
----------------------------------------------------------------------
    [^"]*                    any character except: '"' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
    "                        '" '
----------------------------------------------------------------------
    ['"]*                    any character of: ''', '"' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
                             ' '
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
     '                       ' \''
----------------------------------------------------------------------
    [^']*                    any character except: ''' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
    '                        '\''
----------------------------------------------------------------------
    ['"]*                    any character of: ''', '"' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    [^'">]                   any character except: ''', '"', '>'
----------------------------------------------------------------------
  )+                       end of grouping
----------------------------------------------------------------------
   >                       ' >'
----------------------------------------------------------------------
)                        end of grouping
----------------------------------------------------------------------
Toto
  • 89,455
  • 62
  • 89
  • 125