1

I need to convert the string This <span style="font-size: 16px;" style="color: red;">is</span> a test. to This <span style="font-size: 16px; color: red;">is</span> a test.

There's also the possibility that there could be more than two matches or that there could be a style, then a class, then another style, and the styles would need to be combined. And they won't always be spans

Unfortunately Tidy isn't an option as it is more over-bearing in it's cleaning than this project can accommodate.

Going the DOM document route won't work since multiple style attributes isn't valid, so it only gets the contents of the first one.

I'd like to do it with preg_replace, but getting just the matches from one tag is proving to be quite difficult.

If it makes things easier, they start life as nested tags. I have a preg_replace that combines them from there and gives this output.

klenium
  • 2,468
  • 2
  • 24
  • 47
Arielle Lewis
  • 540
  • 4
  • 14
  • 1
    [Don't use regex](http://stackoverflow.com/questions/3577641/how-do-you-parse-and-process-html-xml-in-php) for it. – klenium Jul 23 '15 at 15:44
  • I think if your code is generating something like this there maybe something wrong with the existing code base. It would be better to fix that than generate another function that will apply a fix ontop of the broken code. I also agree with @klenium don't use regex with HTML – Liam Sorsby Jul 23 '15 at 15:45
  • 1
    I'd love to just use DOM Document, but most of what's happening is just clearing/fixing small strings that may happen inside of tags/attributes, or may not, so regex is the tool for the job. – Arielle Lewis Jul 23 '15 at 15:49
  • 1
    I argee with @LiamSorsby. You yourself shouldn't type `` in your html file, and if you use a framework that adds styles to the tag as `$tag->addStyle("")`, then you can fix it, or use a better class. Store the styles in an array, and when you finished the work, join them at once. You save a lot of time parsing, re-rendering, validating the broken code, even if only 0.01s. Don't do extra work. – klenium Jul 23 '15 at 15:59
  • Adding to what @klenium Regex is never the actual job for html parsing. Ever. There is always a better way to do things. Continually using regex in PHP will drastically slow down the site as everytime that page is generated it will run through that exact same situaltion. Your site will be slow and also put a strain on your server. You would be better fixing the class, if you don't you most certainly will hit a brick wall with issues where your html is broken. Something strange happens with your html or is rendered invalid. – Liam Sorsby Jul 23 '15 at 16:03
  • 1
    @klenium, I'd never type that. This is cleanup of very messy old text/HTML. – Arielle Lewis Jul 23 '15 at 16:24
  • 1
    @LiamSorby, like I said this isn't just HTML that's being modified. In fact, most of it isn't. It's also a script that will be run a couple of times and the cleaned results saved in a database, so speed/server-stress is a non-factor. – Arielle Lewis Jul 23 '15 at 16:26

3 Answers3

0

I agree with the comments above that the best solution is to prevent this situation in the first place, but to answer your question: This function will combine all of the style attributes in the given string. Just make sure to pass only a single tag at a time. It doesn't matter how many other attributes are in the tag, nor does the order matter. It will combine all of the style attributes into the first style value, then remove all other style attributes:

/**
 * @param string $str
 * @return string
 */
function combineStyles($str)
{
    $found = preg_match_all("/style=\"([^\"]+)\"/", $str, $matches);
    if ($found)
    {
        $combined = 'style="' . implode(';', $matches[1]) . '"';
        $patterns = $matches[0];
        $replace = array_pad(array($combined), count($matches[0]), '');
        $str = str_replace($patterns, $replace, $str);
    }
    return $str;
}
Tony
  • 124
  • 4
  • 1
    I had to make a couple little changes to what gets assigned to $combined (submitted an edit for your answer) but other than that it works perfectly. I agree that not being in this situation in the first place is ideal, but I'm working with 6+ year old text/HTML created by a combination of non-technical users, WYSIWYG editors, and Word. – Arielle Lewis Jul 23 '15 at 16:20
  • It eats a lot of memory. – klenium Jul 23 '15 at 16:24
0

Wait, I've just realized it won't work with style="" id="" style="".

<?php
$str = 'This <span  style="font-size: 16px"  style="color: red;">is</span> a test. This <span  style="font-size: 16px;"  style="color: red;">is</span> a test.';

while (preg_match('/"\s+style="/', $str, $matches))
{
    $pos = strpos($str, $matches[0]);
    $prev = substr($str, 0, $pos);
    if (substr(trim($prev), -1) != ";")
        $prev .= ";";
    $str = $prev.substr($str, $pos+strlen($matches[0]));
}
?>
klenium
  • 2,468
  • 2
  • 24
  • 47
0

Using .Net Regular Expressions within Visual Studio 2012's Quick Replace, this expression worked for me:

Find:
style\s*=\s*(?<q2>['"])(?<w1>(?:(?!\k<q2>).)*?);?\k<q2>\s*(?<c>[^<>]*)\s*style\s*=\s*(?<q2>['"])(?<w2>(?:(?!\k<q2>).)*?);?\k<q2>

Replace:
style="${w1};${w2};" ${c}

Notes: 1. This will only merge two style attributes at a time. If there are more than that within a single tag, multiple runs will be required. 2. Any content between the two style attributes will be placed after the first style attribute (which is where the merged style attribute will be placed)

Explanation

Find:

style           # match a style attribute
\s*             # match any optional white space
=               # match equals sign
\*              # match any optional white space
(?<q2>['"])     # match either a single or double quote and stored in named capture 'q'
(?<w1>          # start capture of first style attribute's content
(?:             # start non-capturing match
(?!\k<q2>)      # negative look-ahead to prevent matching on this attribute's quote
.)*?            # end non-capturing match with minimal, 0-many quantifier
)               # end capture of first style attribute's content
;?              # place trailing semi-colon (if present) outside the capture
\k<q2>          # match closing quote

\s*             # match white space
(?<c>[^<>]*)    # capture content between style attributes
\s*             # match white space

...             # repeat the above for a second style attribute
                #    except that the second style's capture is named 'w2'

Replacement:
style="         # start merged style attribute
${w1};          # place first style attribute's content
${w2};          # place second style attribute's content
"               # finish merge style attribute
 ${c}           # restore any content found between the two style attributes
Zarepheth
  • 2,465
  • 2
  • 32
  • 49