Seeking regex for HTML attributes meeting specific criteria

Question

I'm trying to remove single quotes and double quotes around HTML attributes with the following restrictions:

1) The quoted material MUST exist within a tag <> (e.g., <mytag b="yes"> becomes <mytag b=yes>, but <script>var b="yes"</script> stays intact).

2) The quoted material may not have a space character nor an equal sign (e.g., <mytag b="no no" c="no=no"> stays intact).

3) The quoted material may not be in an href or src definition.

4) The regex should be good for UTF-8 (duh!)

Someone posted a virtually identical question here that received an answer that works within the confines of the question:

Removing single and double quote from html attributes with no white spaces on all attributes except href and src

So:

((\S)+\s*(?<!href)(?<!src)(=)\s*)(\"|\')(\S+)(\"|\')

...works, except it fails to isolate text within tags (i.e., text in between opening and closing tags is erroneously edited, e.g. <mytag>"The quotes are stripped out here!"</mytag>), and it doesn't check for equal signs (=) within the quoted text (e.g. <mytag b="OhNo=TheRoutineRemovedTheQuotesBecauseItDidNotCheckForAnEqualSignInTheQuotedText!">).

Bonus points: I wish to integrate this into this php HTML minification routine, which works well except for the edits described above:

https://gist.github.com/tovic/d7b310dea3b33e4732c0

His solution pairs the patterns and replacement params in two arrays, as you'll see, so I need to conform to his syntax, which uses #, etc.

Your solution get my upvote!

This seems like a bad idea. You should try using an HTML parser instead. — Laurel, May 02 '16 at 04:54
You'd be better off using a [`DocumentFragment`](https://developer.mozilla.org/en/docs/Web/API/DocumentFragment) — MDEV, May 02 '16 at 14:48

Wiktor Stribiżew · Accepted Answer · 2016-05-02T14:44:37.583

1

Here is a pure regex way of getting rid of the quotes:

'~(?:<\w+|(?!^)\G)(?:\s+(?:src|href)=(?:"[^"]*"|'[^']*'))*\s+(?!(?:href|src)=)\w+=\K(?|"([^\s"=]*)"|'([^\s'=]*)')~u'

See the regex demo, replace with '$1'.

IDEONE demo:

$re = '~(?:<\w+|(?!^)\G)(?:\s+(?:src|href)=(?:"[^"]*"|\'[^\']*\'))*\s+(?!(?:href|src)=)\w+=\K(?|"([^\s"=]*)"|\'([^\s\'=]*)\')~u';
$str = "<mytag src=\"src_here\" b=\"yes\" href=\"href_here\"> becomes <mytag src=\"src_here\" b=yes href=\"href_here\">\n<mytag b='yes'> becomes <mytag b=yes>\nbut <script>var b=\"yes\"</script> stays intact\n<mytag b=\"no no\" c=\"no=no\"> stays intact\n<tag href=\"something\"> text <tag src=\"dddd\"> intact"; 
$subst = "$1"; 
$result = preg_replace($re, $subst, $str);
echo $result;

Pattern details:

(?:<\w+|(?!^)\G) - match the tag (<\w+) or (|) the end of the last successful match ((?!^)\G)
(?:\s+(?:src|href)=(?:"[^"]*"|\'[^\']*\'))* - matches the unwelcome href and src attributes to later omit them with \K
\s+ - match 1+ whitespace(s)
(?!(?:href|src)=)\w+= - 1+ alphanumeric or underscore characters (\w+) followed with = that are not href= or src= (see (?!(?:href|src)=) negative lookahead)
\K - omit the whole text matched so far
(?|"([^\s"=]*)"|\'([^\s\'=]*)\') - a branch reset group capturing into Group 1 either:
- "([^\s"=]*)" - double quoted attribute with no =, ' and whitespace
- | - or
- \'([^\s\'=]*)\' - single quoted attribute with no =, ' and whitespace

edited May 02 '16 at 14:44

answered May 02 '16 at 13:30

Wiktor Stribiżew

607,720
39
448
563

1

Thanks! This works on my initial testing of it. And I appreciate your clear explanation of the components of the regex! – Tom May 02 '16 at 14:19
I replied too soon...it seems that once "src" or "href" is encountered in a tag, all subsequent elements in that tag are ignored. Try putting ` ` into your regex demo and you'll see. – Tom May 02 '16 at 14:36
You may match them before omitting with `\K`: [`(?:<\w+|(?!^)\G)(?:\s+(?:src|href)=(?:"[^"]*"|'[^']*'))*\s+(?!(?:href|src)=)\w+=\K(?|"([^\s"=]*)"|'([^\s'=]*)')`](https://regex101.com/r/nX9bS5/2). I updated the regex, demos and explanation. – Wiktor Stribiżew May 02 '16 at 14:41
Thanks! But I discovered a wrinkle: If the element has a hyphen in it, then it fails to recognize it as a tag element, and it and the subsequent attribute definitions are ignored; curiously, it's fine with underscores (_). Try: ` ` – Tom May 02 '16 at 14:53
What convention do you need the regex to follow? If attribute names can include dots, hyphens or colons, let's include them and do the same for tag names: [`'~(?:<[a-z][\w:.-]*|(?!^)\G)(?:\s+(?:src|href)=(?:"[^"]*"|\'[^\']*\'))*\s+(?!(?:href|src)=)[a-z][\w:.-]*=\K(?|"([^\s"=]*)"|\'([^\s\'=]*)\')~ui'`](https://regex101.com/r/nX9bS5/3) – Wiktor Stribiżew May 02 '16 at 14:59
I'm using a third-party library that has custom elements that include hyphens in their names, sadly. But your revision almost gets it...but see this example: `` For some reason in this example, it ceases editing out the quotes after the `src` element is encountered, perhaps because the elements with hyphenated names are encountered previously within the tag? – Tom May 02 '16 at 16:20
I believe that once it's found an element with a space or similar that requires it to retain the surrounding quotes, it ceases editing the attributes that follow for the rest of the tag. – Tom May 02 '16 at 16:29
Aha! It simply needs a `*` after the initial pattern, like this: `~(?:<[a-z][\w:.-]*|(?!^)\G)*`. Thus the complete solution is `~(?:<\w+|(?!^)\G)*(?:\s+(?:src|href)=(?:"[^"]*"|\'[^\']*\'))*\s+(?!(?:href|src)=)\w+=\K(?|"([^\s"=]*)"|\'([^\s\'=]*)\')~u` I owe you a beer :-) – Tom May 02 '16 at 17:10
I think what you did is incorrect. I will check once my kids go to bed. – Wiktor Stribiżew May 02 '16 at 17:45
1

It is rather unweildly, but this is what you have to pay for with such complex requirements: [`(?:<[a-z][\w:.-]*|(?!^)\G)(?:\s+(?:(?:src|href)=(?:"[^"]*"|'[^']*')|[a-z][\w:.-]*="(?:[^"=]*\s[^"=]*|[^"\s]*=[^"\s]*)"))*\s+(?!(?:href|src)=)[a-z][\w:.-]*=\K(?|"([^\s"=]*)"|'([^\s'=]*)')`](https://regex101.com/r/nX9bS5/4). All exceptions must be matched in the same branch with `\G`. – Wiktor Stribiżew May 02 '16 at 20:48
This is fantastic. I owe you several beers. – Tom May 02 '16 at 21:58
I hate to chime in 8 months later with an additional request, but I'm now dealing with meta tags with content definitions. As with other tags, no space characters ( ) or equal signs (=) means that that the surrounding quotes may be stripped, but their presence means that the quotes must be left intact. But here's the wrinkle: if the definition begins with http:// or https://, then I need to retain the surrounding quotes. I've been trying to edit the regex above to accomplish this for several hours, but I'm coming up dry. Can you see an easy fix? Many thanks in advance. – Tom Jan 06 '17 at 18:28
Please provide some test cases showing what you need to do. – Wiktor Stribiżew Jan 06 '17 at 18:40
I think we're in luck, as it turns out that the forward slash character in a property value is grounds for the value to remain in quotes (double or single). Thus, both content="image/jpeg" and src="https://example.com" would retain their surrounding quotes, as all that needs to be checked for is the presence of a forward slash, which occurs in both. Also, the presence of a comma requires the value to retain its quotes as in content="noindex,nofollow". See before and after below; perhaps just the forward slash and the comma should be added to the list of characters to be searched for: – Tom Jan 06 '17 at 19:22
`code ` – Tom Jan 06 '17 at 19:22
Here: ` ` – Tom Jan 06 '17 at 19:29
Try [`(?:<[a-z][\w:.-]*|(?!^)\G)\s+(?:(?:href|src)=(['"]).*?\1\K|[a-z][\w.:-]*=(['"])https?:\/\/.*?\2\K|[a-z][\w:.-]*=\K(?|"([^\s"\/,=]*)"|'([^\s'\/,=]*)'))`](https://regex101.com/r/nX9bS5/5) to replace with `$3`. – Wiktor Stribiżew Jan 06 '17 at 20:12
Thanks, but there are apparently unescaped single quotes and other characters that don't allow the php to be read (even though it works fine as-is in the online regex tester.); my editor stops showing it as valid PHP code as soon as it encounters the "https" text (although the missing escapements may start earlier in the regex). – Tom Jan 06 '17 at 21:59

cdm · Answer 2 · 2016-05-02T17:10:43.167

0

Use this (<[^=]*?(?<!href)(?<!src)=)"((\p{L}|\d)+)"(.*?>) and replace 1st, 2nd and 4th capturing group with preg_replace while the replacements occure.

$a = '<aaa href="123ff" bbb="aaa">';
do {
  $b = preg_replace('/(<[^>]*?(?<!href)(?<!src)=)"((\\p{L}|\\d)+)"(.*?>)/u', '$1$2$4', $a, -1, $count);
  if(!$count) {
    break;
  }
  $a = $b;
}while(true);

edited May 02 '16 at 17:10

answered May 02 '16 at 07:51

cdm

1,360
11
18

Thanks, but this erroneously edits ``, as described in the first condition in my description. The loop suggests a long processing time. – Tom May 02 '16 at 14:18
I updated the regex. Now it should match things only in attributes. – cdm May 02 '16 at 17:12

Seeking regex for HTML attributes meeting specific criteria

2 Answers2