7

this regular expression should match an html start tag, I think.

var results = html.match(/<(\/?)(\w+)([^>]*?)>/);

I see it should first capture the <, but then I am confused what this capture (\/?) accomplishes. Am I correct in reasoning that the ([^>]*?)> searches for every character except > >= 0 times? If so, why is the (\w+) capture necessary? Doesn't it fall within the purview of [^>]*?

1252748
  • 14,597
  • 32
  • 109
  • 229
  • it finds end tags, you know instead of ... the \w captures the tag name to a parameter to use in replacement instead of bundling it with the attrib section... for a match you don't need it, but if help the regexp if recycled into a replace()... – dandavis Jul 03 '13 at 16:43

5 Answers5

4

Take it token by token:

  • / begin regex literal
  • < match a literal <
  • (\/?) match 0 or 1 (?) literal /, which is escaped by the \
  • (\w+) match one or more "word characters"
  • ([^>]*?) lazily* match zero or more (*?) of anything that is not a >
  • > match a literal >
  • / end regex literal

lazily* - adding "?" after a repetition quantifier will make it perform lazily, meaning the regex will match the preceding token the minimum number of times. See the documentation.

So essentially this regular expression will match "<", potentially followed by a "/", followed by any number of letters, digits, or underscores, followed by anything that is not a ">", and finally followed by a ">".

That being said, the token (\w+) is not redundant, as it ensures there is at least one word character in between < and >.

Please be aware that attempting to parse HTML with regular expressions is generally a bad idea.

Community
  • 1
  • 1
jbabey
  • 45,965
  • 12
  • 71
  • 94
  • The "?" is not redundant, in case there is more than one html tag on the same line! – Tom Lord Jul 03 '13 at 16:47
  • @TomLord I have edited the answer to include what the `*?` actually does. Learned something new myself :) – jbabey Jul 03 '13 at 16:53
  • @TomLord why is it useful to have this match be "lazy"? – 1252748 Jul 03 '13 at 16:56
  • @TomLord, I beg to differ. Even with `[^>]*`, it won't match multiple tags because it has to go through a closing angle bracket (`>`) to do so. So yes, the `?` is actually redundant here. – doubleDown Jul 03 '13 at 17:03
  • 1
    Actually, I take that back - it *is* actually redundant. But let me explain... Suppose the code had said (.*?) instead of ([^>]*?). (This is what I thought it basically was doing, at a glance!) Compare what happens with/without the "?": http://www.rubular.com/r/U5VtGkFY3q http://www.rubular.com/r/Fsig98EoDg – Tom Lord Jul 03 '13 at 17:04
  • In that second (bad) example, the .* operator is being "greedy" (as opposed to "lazy"), and is matching **as much as it can**, up to the final ">". – Tom Lord Jul 03 '13 at 17:06
  • 1
    In the actual code, however, this is not a problem because even a "greedy" ([^>]*) will only match up to the first ">". – Tom Lord Jul 03 '13 at 17:07
4

Using the power of debuggex to generate you an image :)

<(\/?)(\w+)([^>]*?)>

Will be evaluated like this

Regular expression image

Edit live on Debuggex

As you can see, it matches HTML-tags (opening and closing tags). The regex contains three capture groups, capturing the following:

  1. (\/?) existence of / (it's a closing tag, if present)
  2. (\w+) name of the tag
  3. ([^>]*?) everything else until the tag closes (e.g. attributes)

This way it matches <a href="#">. Interestingly it does not match <a data-fun="fun>nofun"> correctly because it stops at the > within the data-fun attribute. Although (I think) > is valid in an attribute value.

Another funny thing is, that the tag-name capture, does not capture all theoretically valid XHTML tags. XHTML allows Letter | Digit | '.' | '-' | '_' | ':' | .. (source: XHTML spec). (\w+), however, does not match ., -, and :. An imaginary <.foobar> tag will not be matched by this regex. This should not have any real life impact, though.

You see that parsing HTML using RgExes is a risky thing. You might be better of with a HTML parser.

Community
  • 1
  • 1
tessi
  • 13,313
  • 3
  • 38
  • 50
3

(\/?) matches, and catches any closing tag, such as </i> maybe, or </strong> if you're familiar with them?

Another thing to note is that \w is really the character class [a-zA-Z_\d], so that other characters like =, ", etc are not matched, and will however be matched by [^>]. And yes, you are correct about that bit.

Jerry
  • 70,495
  • 13
  • 100
  • 144
  • @p.s.w.g Yes, it won't match the regex. It was just as an example for the forward slash, but I guess I'll just use another =/ – Jerry Jul 03 '13 at 16:46
  • Thanks. I don't understand what bit I am correct about though – 1252748 Jul 03 '13 at 16:54
  • @thomas This bit: "Am I correct in reasoning that the `([^>]*?)>` searches for every character except `>` >= 0 times?" :) – Jerry Jul 03 '13 at 16:56
2

To answer your last question, (\w+) and ([^>]*?) are not redundant. They both serve important functions in the expression.

This expression finds start or end tags.

(\/?) matches a /, but the ? makes it optional.

(\w+) matches word characters, intended to match the tag name here.

([^>]*?) is intended to match attributes.

So if you had the string <div class="text">,

The (\w+) in the expression would match div and the ([^>]*?) would match class="text"

Jason P
  • 26,984
  • 3
  • 31
  • 45
  • It's picky, but the `([^>]*?)` actually matches ` class="text"` (including the space after 'div'). :) – tessi Jul 03 '13 at 17:50
  • @tessi You're right, and if you edit my post, you can see I have the space in there, but it gets removed when it is displayed. If someone can suggest a way to get the space to display, I'd appreciate it. – Jason P Jul 03 '13 at 17:52
  • Har, that's funny. I will never blame you again because of that (specific) space then ;) – tessi Jul 03 '13 at 17:55
  • Agreed. `([^>]*?)` *by itself* is not redundant. However in `(\w+)([^>]*?)>` the lazy operation *is* redundant. The regex has to match one or more "word" characters (which aren't ">"), followed by zero or more non right pointing angle bracket characters, followed by a right pointing angle bracket character. Greedy or not, it has to match that character (class) sequence exactly. Lazy only applies when the types of characters you are matching may also match what follows, like in `.*?<`; `.` can match "<". – JayC Jul 03 '13 at 20:15
  • @JayC That may be. I think that sentence in my answer wasn't clear, hopefully the edit is more clear. – Jason P Jul 03 '13 at 20:57
0

Demo (in ruby, not javascript, but it makes no difference): http://www.rubular.com/r/bhw2O28qUr

To summarise, it's to capture end tags.

Tom Lord
  • 27,404
  • 4
  • 50
  • 77