3

I found this useful regex code here while looking to parse HTML tag attributes:

(\S+)=["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))+.)["']?

It works great, but it's missing one key element that I need. Some attributes are event triggers that have inline Javascript code in them like this:

onclick="doSomething(this, 'foo', 'bar');return false;"

Or:

onclick='doSomething(this, "foo", "bar");return false;'

I can't figure out how to get the original expression to not count the quotes from the JS (single or double) while it's nested inside the set of quotes that contain the attribute's value.

I SHOULD add that this is not being used to parse an entire HTML document. It's used as an argument in an older "array to select menu" function that I've updated. One of the arguments is a tag that can append extra HTML attributes to the form element.

I've made an improved function and am deprecating the old... but in case somewhere in the code is a call to the old function, I need it to parse these into the new array format. Example:

// Old Function
function create_form_element($array, $type, $selected="", $append_att="") { ... }
// Old Call
create_form_element($array, SELECT, $selected_value, "onchange=\"something(this, '444');\"");

The new version takes an array of attr => value pairs to create extra tags.

create_select($array, $selected_value, array('style' => 'width:250px;', 'onchange' => "doSomething('foo', 'bar')"));

This is merely a backwards compatibility issue where all calls to the OLD function are routed to the new one, but the $append_att argument in the old function needs to be made into an array for the new one, hence my need to use regex to parse small HTML snippets. If there is a better, light-weight way to accomplish this, I'm open to suggestions.

Deduplicator
  • 44,692
  • 7
  • 66
  • 118
mwieczorek
  • 2,107
  • 6
  • 31
  • 37

3 Answers3

4

The problem with your regular expression is that it tries to handle both single and double quotes at the same time. It doesn't support attribute values that contain the other quote. This regex will work better:

(\w+)=("[^<>"]*"|'[^<>']*'|\w+)
Jan Goyvaerts
  • 21,379
  • 7
  • 60
  • 72
  • 2
    Close, but HTML 4.01 attribute values _can_ contain angle brackets. Also, an attribute name and an unquoted attribute value may contain dashes, dots and colons. A better expression is thus: `([\w\-.:]+)\s*=\s*("[^"]*"|'[^']*'|[\w\-.:]+)` (Pedantic, I know...) – ridgerunner Oct 12 '11 at 16:19
  • Well, this doesn't work for Text where there's a `a="b+c"` without a tag around it, maybe try: `(<\w.)([\w\-.:]+)\s*=\s*("[^"]*"|'[^']*'|[\w\-.:]+)` – sneaky Mar 12 '19 at 15:29
2

following regex will work as per HTML syntax specs available here

http://www.w3.org/TR/html-markup/syntax.html

regex patterns

// valid tag names
$tagname = '[0-9a-zA-Z]+';
// valid attribute names
$attr = "[^\s\\x00\"'>/=\pC]+";
// valid unquoted attribute values
$uqval = "[^\s\"'=><`]*";
// valid single-quoted attribute values
$sqval = "[^'\\x00\pC]*";
// valid double-quoted attribute values
$dqval = "[^\"\\x00\pC]*";
// valid attribute-value pairs
$attrval = "(?:\s+$attr\s*=\s*\"$dqval\")|(?:\s+$attr\s*=\s*'$sqval')|(?:\s+$attr\s*=\s*$uqval)|(?:\s+$attr)"; 

and the final regex query will be

    // start tags + all attr formats
    $patt[] = "<(?'starttags'$tagname)(?'tagattrs'($attrval)*)\s*(?'voidtags'[/]?)>";

    // end tags
    $patt[] = "</(?'endtags'$tagname)\s*>"; // end tag

    // full regex pcre pattern
    $patt = implode("|", $patt);
    // search and match
    preg_match_all("#$patt#imuUs",$data,$matches);

hope this helps.

Tech Consultant
  • 374
  • 1
  • 7
0

Even better would be to use backreferences, in PHP the regular expression would be:

([a-zA-Z_:][-a-zA-Z0-9_:.]+)=(["'])(.*?)\\2

Where \\2 is a reference to (["'])

Also this regular expression will match attributes containing _, - and :, which are allowed according to W3C, however, this expression wont match attributes which values are not contained in quotes.

Koen.
  • 25,449
  • 7
  • 83
  • 78
  • What about: `"attrib_name = unquoted_value"`? – ridgerunner Oct 12 '11 at 16:21
  • Spaces in attribute/value definitions is –as far as I know– not allowed.. Or isn't that what you are asking? – Koen. Oct 16 '11 at 12:03
  • Sorry for not being clear. Your solution matches attributes with values that are quoted, (either `name="dq val"` or `name='sq val'`), but fails to match attributes with _unquoted_ values (`name=uq_val`. – ridgerunner Oct 19 '11 at 03:03
  • Yes i'm aware of that, and i also mentioned that in my answer ;) Nevertheless, according to the W3 xhtml spec, attribute values must always be quoted. (http://www.w3.org/TR/xhtml1/#h-4.4) And now with html5 and data attributes filled with objects, the quoting is almost necessary. – Koen. Oct 27 '11 at 22:35
  • Koen, extra whitespace between an attribute, the `=` sign, and the attribute value is allowed. Also, I don’t see how XHTML is relevant anno 2013. In HTML quotes around attribute values are optional (in general), but depending on the attribute value you want to use you’re still gonna need them every now and then. See [Unquoted attribute values in HTML, CSS and JavaScript](http://mathiasbynens.be/notes/unquoted-attribute-values) for more information. – Mathias Bynens Mar 08 '13 at 10:53