-1

I can't get this regex right, and I don't see what I'm missing. See Regex101 example, or breakdown bellow:

Regex

<span.*?font-weight:700.*?>(.*?)<\/span>

I'm trying to find every instance of span that contains font-weight:700.

<p><span style="color:#2c2c2c;font-weight:700;text-decoration:none;vertical-align:baseline;font-size:10.5pt;font-family:&quot;Arial&quot;;font-style:normal">Strong content</span></p><ul><li><span style="color:#2c2c2c;font-weight:400;text-decoration:none;vertical-align:baseline;font-size:10.5pt;font-family:&quot;Arial&quot;;font-style:normal">list item</span></li><li><span style="color:#2c2c2c;font-weight:400;text-decoration:none;vertical-align:baseline;font-size:10.5pt;font-family:&quot;Arial&quot;;font-style:normal">list item</span></li><li><span style="color:#2c2c2c;font-weight:400;text-decoration:none;vertical-align:baseline;font-size:10.5pt;font-family:&quot;Arial&quot;;font-style:normal">list item</span></li><li><span style="color:#2c2c2c;font-weight:400;text-decoration:none;vertical-align:baseline;font-size:10.5pt;font-family:&quot;Arial&quot;;font-style:normal">list item</span></li><li><span style="color:#2c2c2c;font-weight:400;text-decoration:none;vertical-align:baseline;font-size:10.5pt;font-family:&quot;Arial&quot;;font-style:normal">list item</span></li><li><span style="color:#2c2c2c;font-weight:400;text-decoration:none;vertical-align:baseline;font-size:10.5pt;font-family:&quot;Arial&quot;;font-style:normal">list item</span></li></ul><p><span style="color:#2c2c2c;font-weight:400;text-decoration:none;vertical-align:baseline;font-size:10.5pt;font-family:&quot;Arial&quot;;font-style:normal">Content text</span></p><p><span style="color:#2c2c2c;font-weight:400;text-decoration:none;vertical-align:baseline;font-size:10.5pt;font-family:&quot;Arial&quot;;font-style:normal">Content text</span></p><p><span style="color:#2c2c2c;font-weight:400;text-decoration:none;vertical-align:baseline;font-size:10.5pt;font-family:&quot;Arial&quot;;font-style:normal">Content text</span></p><p><span style="font-size:10.5pt;color:#2c2c2c;font-weight:700">Should be bold</span><span style="color:#2c2c2c;font-weight:400;text-decoration:none;vertical-align:baseline;font-size:10.5pt;font-family:&quot;Arial&quot;;font-style:normal">: regular text</span></p><p><span style="font-size:10.5pt;color:#2c2c2c;font-weight:700">Should be bold</span><span style="color:#2c2c2c;font-weight:400;text-decoration:none;vertical-align:baseline;font-size:10.5pt;font-family:&quot;Arial&quot;;font-style:normal">: regular text </span></p><p><span style="font-size:10.5pt;color:#2c2c2c;font-weight:700">Should be bold</span><span style="color:#2c2c2c;font-weight:400;text-decoration:none;vertical-align:baseline;font-size:10.5pt;font-family:&quot;Arial&quot;;font-style:normal">: regular text</span></p>

Get the content of that span and replace it with

<strong>$1</strong>

The problem is that this is my result:

<p><strong>Strong content</strong></p><ul><li><strong>Should be bold</strong><strong>Should be bold</strong><strong>Should be bold</strong><span style="color:#2c2c2c;font-weight:400;text-decoration:none;vertical-align:baseline;font-size:10.5pt;font-family:&quot;Arial&quot;;font-style:normal">: regular text</span></p>

It cuts out all the list items, and removes "regular text" after match 2 and 3.

The expected output is:

<p><strong>Strong content</strong></p><ul><li><span style="color:#2c2c2c;font-weight:400;text-decoration:none;vertical-align:baseline;font-size:10.5pt;font-family:&quot;Arial&quot;;font-style:normal">list item</span></li><li><span style="color:#2c2c2c;font-weight:400;text-decoration:none;vertical-align:baseline;font-size:10.5pt;font-family:&quot;Arial&quot;;font-style:normal">list item</span></li><li><span style="color:#2c2c2c;font-weight:400;text-decoration:none;vertical-align:baseline;font-size:10.5pt;font-family:&quot;Arial&quot;;font-style:normal">list item</span></li><li><span style="color:#2c2c2c;font-weight:400;text-decoration:none;vertical-align:baseline;font-size:10.5pt;font-family:&quot;Arial&quot;;font-style:normal">list item</span></li><li><span style="color:#2c2c2c;font-weight:400;text-decoration:none;vertical-align:baseline;font-size:10.5pt;font-family:&quot;Arial&quot;;font-style:normal">list item</span></li><li><span style="color:#2c2c2c;font-weight:400;text-decoration:none;vertical-align:baseline;font-size:10.5pt;font-family:&quot;Arial&quot;;font-style:normal">list item</span></li></ul><p><span style="color:#2c2c2c;font-weight:400;text-decoration:none;vertical-align:baseline;font-size:10.5pt;font-family:&quot;Arial&quot;;font-style:normal">Content text</span></p><p><span style="color:#2c2c2c;font-weight:400;text-decoration:none;vertical-align:baseline;font-size:10.5pt;font-family:&quot;Arial&quot;;font-style:normal">Content text</span></p><p><span style="color:#2c2c2c;font-weight:400;text-decoration:none;vertical-align:baseline;font-size:10.5pt;font-family:&quot;Arial&quot;;font-style:normal">Content text</span></p><p><strong>Should be bold</strong><span style="color:#2c2c2c;font-weight:400;text-decoration:none;vertical-align:baseline;font-size:10.5pt;font-family:&quot;Arial&quot;;font-style:normal">: regular text</span></p><p><strong>Should be bold</strong><span style="color:#2c2c2c;font-weight:400;text-decoration:none;vertical-align:baseline;font-size:10.5pt;font-family:&quot;Arial&quot;;font-style:normal">: regular text </span></p><p><strong>Should be bold</strong><span style="color:#2c2c2c;font-weight:400;text-decoration:none;vertical-align:baseline;font-size:10.5pt;font-family:&quot;Arial&quot;;font-style:normal">: regular text</span></p>
axelra82
  • 517
  • 8
  • 23
  • 1
    This is not a task that is very suitable for regex. Please see: [How do you parse and process HTML/XML in PHP?](https://stackoverflow.com/questions/3577641/how-do-you-parse-and-process-html-xml-in-php) – Blue Oct 27 '18 at 16:02

2 Answers2

0

Just swapping elements can be achieved via this thread, Replace Tag in HTML with DOMDocument. Here's an expanded approach of that to only affect the elements with that style attribute.

$html = '<p><span style="color:#2c2c2c;font-weight:700;text-decoration:none;vertical-align:baseline;font-size:10.5pt;font-family:&quot;Arial&quot;;font-style:normal">Strong content</span></p><ul><li><span style="color:#2c2c2c;font-weight:400;text-decoration:none;vertical-align:baseline;font-size:10.5pt;font-family:&quot;Arial&quot;;font-style:normal">list item</span></li><li><span style="color:#2c2c2c;font-weight:400;text-decoration:none;vertical-align:baseline;font-size:10.5pt;font-family:&quot;Arial&quot;;font-style:normal">list item</span></li><li><span style="color:#2c2c2c;font-weight:400;text-decoration:none;vertical-align:baseline;font-size:10.5pt;font-family:&quot;Arial&quot;;font-style:normal">list item</span></li><li><span style="color:#2c2c2c;font-weight:400;text-decoration:none;vertical-align:baseline;font-size:10.5pt;font-family:&quot;Arial&quot;;font-style:normal">list item</span></li><li><span style="color:#2c2c2c;font-weight:400;text-decoration:none;vertical-align:baseline;font-size:10.5pt;font-family:&quot;Arial&quot;;font-style:normal">list item</span></li><li><span style="color:#2c2c2c;font-weight:400;text-decoration:none;vertical-align:baseline;font-size:10.5pt;font-family:&quot;Arial&quot;;font-style:normal">list item</span></li></ul><p><span style="color:#2c2c2c;font-weight:400;text-decoration:none;vertical-align:baseline;font-size:10.5pt;font-family:&quot;Arial&quot;;font-style:normal">Content text</span></p><p><span style="color:#2c2c2c;font-weight:400;text-decoration:none;vertical-align:baseline;font-size:10.5pt;font-family:&quot;Arial&quot;;font-style:normal">Content text</span></p><p><span style="color:#2c2c2c;font-weight:400;text-decoration:none;vertical-align:baseline;font-size:10.5pt;font-family:&quot;Arial&quot;;font-style:normal">Content text</span></p><p><span style="font-size:10.5pt;color:#2c2c2c;font-weight:700">Should be bold</span><span style="color:#2c2c2c;font-weight:400;text-decoration:none;vertical-align:baseline;font-size:10.5pt;font-family:&quot;Arial&quot;;font-style:normal">: regular text</span></p><p><span style="font-size:10.5pt;color:#2c2c2c;font-weight:700">Should be bold</span><span style="color:#2c2c2c;font-weight:400;text-decoration:none;vertical-align:baseline;font-size:10.5pt;font-family:&quot;Arial&quot;;font-style:normal">: regular text </span></p><p><span style="font-size:10.5pt;color:#2c2c2c;font-weight:700">Should be bold</span><span style="color:#2c2c2c;font-weight:400;text-decoration:none;vertical-align:baseline;font-size:10.5pt;font-family:&quot;Arial&quot;;font-style:normal">: regular text</span></p>';
$dom = new domdocument();
$dom->loadhtml($html);
$elements = $dom->getElementsByTagName("span");
for ($i = $elements->length - 1; $i >= 0; $i --) {
    if(preg_match('/font-weight:700/', $elements[$i]->getattribute('style'))) {
        $nodePre = $elements->item($i);
        $nodeDiv = $dom->createElement("strong", $nodePre->nodeValue);
        $nodePre->parentNode->replaceChild($nodeDiv, $nodePre);
    }
}
echo $dom->savehtml();

https://3v4l.org/Y7Rua

Alternative to:

if(preg_match('/font-weight:700/', $elements[$i]->getattribute('style'))) {

strpos also could be used, I'm guessing you might have white spaces though so I went with the regex version.

if(strpos($elements[$i]->getattribute('style'), 'font-weight:700') !== FALSE) {

https://3v4l.org/uqWpj

For the answer of why your regex cuts more than you want it is because <span.* matches <span style="color:#2c2c2c;font-weight:400; and keeps going until it finds the font-weight:700. It then captures the content after that element and all your middle data is lost. This is why regex should not be used for parsing, it is not aware of elements.

user3783243
  • 5,368
  • 5
  • 22
  • 41
0

The reason your regex doesn't work is that some span tags don't contain that font-weight.
This causes the regex part .*? to continue matching until it finds a span tag with
that font-weight.

This regex will restrict the match to a valid tag containing that font-weight.

Find:

/<span(?=\s)(?=(?:[^>"']|"[^"]*"|'[^']*')*?\sstyle\s*=\s*(?:(['"])(?:(?!\1)[\S\s])*?font-weight:700(?:(?!\1)[\S\s])*\1))\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]*?)+>([\S\s]*?)<\/span\s*>/

Replace: <strong>$2</strong>

https://regex101.com/r/o9qcHz/1

More regex info:

 # Begin open Span tag

 < span
 (?= \s )
 (?=                    # Asserttion (a pseudo atomic group)
      (?: [^>"'] | " [^"]* " | ' [^']* ' )*?
      \s style \s* = \s* 
      (?:
           ( ['"] )               # (1), Quote
           (?:
                (?! \1 )
                [\S\s] 
           )*?
           font-weight:700        # font weight 700
           (?:
                (?! \1 )
                [\S\s] 
           )*
           \1 
      )
 )
                        # Have the correct font-weighT, just match the rest of tag
 \s+ 
 (?: " [\S\s]*? " | ' [\S\s]*? ' | [^>]*? )+

 >                      # End span tag

 ( [\S\s]*? )           # (2), span content
 </span \s* >           # Close span tag