php DOMDocument preg_replace fail detect

Question

Basically, I want to replace content with hyperlink when detected matching keyword tag. the replace need to be outside of caption/image/figure/figcaption/iframe/a of existing content, because putting hyperlink inside these will causing format breaking.

my php

 $html_content= '税务调查。

[caption id="attachment_111" align="aligncenter" width="100"]<img src="https://royaldesign.com/image/11/gubi-moon-dining-table-round-120-h73-3?w=168&quality=80" alt="拜登与儿子。" width="100" height="100" class="size-full wp-image" /> 拜登与儿子。[/caption]

他在声明中说：“我会非常认真地调查，往来。”

<img src="https://royaldesign.com/image/11/gubi-moon-dining-table-round-120-h73-3?w=168&quality=80" alt="拜登总统" width="100" height="100" class="aligncenter size-full wp-image" />

<div style="position:relative; overflow:hidden"> <iframe src="https://cdn.google.com/players/VM.html" width="100" height="100" frameborder="0" scrolling="auto" title="大促销 拜登的美国" style="position:absolute;"></iframe> </div>

<iframe style="border: none; overflow: hidden;" src="https://www.facebook.com/plugins/video.php?height=100&amp;href=https%3A%2F%2Fwww.facebook.com;width=100&amp;t=0" width="100" height="100" frameborder="0" allowfullscreen="allowfullscreen"></iframe>

<iframe src="https://www.facebook.com/plugins/video.php?height=400&href=https%3A%2F%2&show_text=false&width=100&t=0" width="100" height="100" style="border:none;overflow:hidden" scrolling="no" frameborder="0" allowfullscreen="true" allow="autoplay; clipboard-write; encrypted-media; picture-in-picture; web-share" allowFullScreen="true"></iframe>

<b>更多热点</b>

<p>halo拜登也指美国经济不会衰退</p>

<figure id="attachment_279" style="width: 100px" class="wp-caption alignnone"><img class="size-full wp-imag" src="https://royaldesign.com/image/11/gubi-moon-dining-table-round-120-h73-3?w=168&quality=80" alt="修理厂商总会拜登城" width="100" height="100" /><figcaption class="wp-caption-text">修理厂商总会拜登城</figcaption></figure>

<a href="http://google.com">go to google</a>

<span style="color: #ff6600;"><strong>另外，拜登声明中说</strong></span>';


function preg_replace_dom($regex, $replacement, DOMNode $dom, array $excludeParents = array()) {
  if (!empty($dom->childNodes)) {
    foreach ($dom->childNodes as $node) {
        //echo $node->parentNode->nodeName . "<Br>";
      if ($node instanceof DOMText && !in_array($node->parentNode->nodeName, $excludeParents)) {
        $node->nodeValue = preg_replace($regex, $replacement, $node->nodeValue);
      } 
      else{
        preg_replace_dom($regex, $replacement, $node, $excludeParents);
      }
    }
  }
}


$dom = new DOMDocument;
$internalErrors = libxml_use_internal_errors(true);
$dom->loadHTML( mb_convert_encoding($html_content, 'HTML-ENTITIES', "UTF-8"), LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED );


$tags = array("拜登","认真");
foreach($tags as $tag){
    $tagurl= '<span class="article-tag"><a class="mytag" href="http://outside.com" >'.$tag.'</a></span>';
    preg_replace_dom('/'.$tag.'/i', $tagurl, $dom->documentElement, array('a','image','iframe','figure','figcaption','caption'));

    $test_tag = '['.$tag.']';
    //preg_replace_dom('/'.$tag.'/i', $test_tag, $dom->documentElement, array('a','image','iframe','figure','figcaption','caption'));
}     


function getLink($tag){
    $arr = array(
        "拜登"=>"http://bai.com",
        "认真"=>"http://ren.com",      
        );
    return $arr[$tag];    
}

 $output = mb_substr($dom->saveHTML(), 0, null, "UTF-8");
//echo $output;
echo html_entity_decode($output);

Now I facing 2 issue

want to exclude replace hyperlink tag into [caption id=...] ... [/caption]
but it fail on regex..

currently it display like this...

this DOMDocument loadHTML method will add in extra paragraph tag randomly at any places... Although I can process the output by removing ALL the paragraph tag, but it also means the final content is not original anymore. Some input content by default have some paragraph tag, so this action will end up making existing p tag gone too..
(solved) want to preg_replace as clickable hyperlink to display at browser. but echo $output showing the pure raw hyperlink syntax, unable to click..

update on issue2, value saved into $node->nodeValue are escaped and causing pure plain text. I add in this to unescape it, echo html_entity_decode($output); and it now display correctly.

Desired output

 $output= '税务调查。

[caption id="attachment_111" align="aligncenter" width="100"]<img src="https://royaldesign.com/image/11/gubi-moon-dining-table-round-120-h73-3?w=168&quality=80" alt="拜登与儿子。" width="100" height="100" class="size-full wp-image" /> 拜登与儿子。[/caption]

他在声明中说：“我会非常<span class="article-tag"><a class="mytag" href="http://outside.com" >认真</a></span>地调查，往来。”

<img src="https://royaldesign.com/image/11/gubi-moon-dining-table-round-120-h73-3?w=168&quality=80" alt="拜登总统" width="100" height="100" class="aligncenter size-full wp-image" />

<div style="position:relative; overflow:hidden"> <iframe src="https://cdn.google.com/players/VM.html" width="100" height="100" frameborder="0" scrolling="auto" title="大促销 拜登的美国" style="position:absolute;"></iframe> </div>

<iframe style="border: none; overflow: hidden;" src="https://www.facebook.com/plugins/video.php?height=100&amp;href=https%3A%2F%2Fwww.facebook.com;width=100&amp;t=0" width="100" height="100" frameborder="0" allowfullscreen="allowfullscreen"></iframe>

<iframe src="https://www.facebook.com/plugins/video.php?height=400&href=https%3A%2F%2&show_text=false&width=100&t=0" width="100" height="100" style="border:none;overflow:hidden" scrolling="no" frameborder="0" allowfullscreen="true" allow="autoplay; clipboard-write; encrypted-media; picture-in-picture; web-share" allowFullScreen="true"></iframe>

<b>更多热点</b>

<p>halo<span class="article-tag"><a class="mytag" href="http://outside.com" >拜登</a></span>也指美国经济不会衰退</p>

<figure id="attachment_279" style="width: 100px" class="wp-caption alignnone"><img class="size-full wp-imag" src="https://royaldesign.com/image/11/gubi-moon-dining-table-round-120-h73-3?w=168&quality=80" alt="修理厂商总会拜登城" width="100" height="100" /><figcaption class="wp-caption-text">修理厂商总会拜登城</figcaption></figure>

<a href="http://google.com">go to google</a>

<span style="color: #ff6600;"><strong>另外，<span class="article-tag"><a class="mytag" href="http://outside.com" >拜登</a></span>声明中说</strong></span>';

No use of XPath with DOMDocument? I usually pair these together for convenience. You can search my answers for "xpath" to see some examples. Please [edit] your question to show your exact desired result. Your question title is a little too vague to be searchable for future researchers needing the same type of resolution -- it may need refinement. — mickmackusa, Jul 28 '22 at 02:11
just attached an image of the problematic output for easy ref. — i need help, Jul 28 '22 at 02:50
Your content is large to view on my phone. It is taking me a while to scrutinize. If I can form a complete resolution, I'll post an answer. Here is half an answer: https://3v4l.org/sMjbU (`saveHTML($dom->documentElement)`) from https://stackoverflow.com/a/20675396/2943403 — mickmackusa, Jul 28 '22 at 03:11
How close is this? https://3v4l.org/4TEuR It is hard for me to spot flaws in the text on my tiny screen. — mickmackusa, Jul 28 '22 at 04:06
If you click the eyeball icon on the demo, it shows the substring within in figcaption as a blue link. I'll keep trying. — mickmackusa, Jul 28 '22 at 04:16
seen it, but i only want to change it to hyperlink when the matching is happen outside of figure/image/figcaption/caption/iframe boundary. Although this word match, but it sitting inside figcaption, so no need to do any changes. — i need help, Jul 28 '22 at 04:41
That doesn't look like HTML or XML but more like Markdown (Markdown can contain HTML). A pure DOM based solution can not handle the `[caption]` syntax. However, here is an example for matching and wrapping text in an HTML: https://stackoverflow.com/a/71295346/497139 — ThW, Jul 29 '22 at 08:10

mickmackusa · Accepted Answer · 2022-07-28T22:00:38.307

1

I tried very, VERY hard to implement a DOMDocument+Xpath solution, but I came unstuck while trying to disqualify the text node within the square-tagged caption block. I couldn't manage to isolate the whole caption block to be able to exclude it. In the end, here is a caveman's regex approach to serve as a band-aid until someone smarter can solve this problem properly.

The regex matches the blacklisted tags in the text and discards them; it only replaces text that is not disqualified.

Code: (Demo)

$tags = ["拜登", "认真"];
$blacklisted = implode(
    '|',
    array_map(
        fn($tag) => "<{$tag}[ >].+?" . ($tag === 'img' ? "/>" : "</$tag>"),
        ['a', 'img', 'iframe', 'figure', 'figcaption']
    )
);
echo preg_replace(
         sprintf('~(?:\[caption[ \]].+?\[/caption]|%s)(*SKIP)(*FAIL)|%s~us', $blacklisted, implode('|', $tags)),
         '<span class="article-tag"><a class="mytag" href="http://outside.com">$0</a></span>',
         $html
     );

edited Jul 28 '22 at 22:00

answered Jul 28 '22 at 17:04

mickmackusa

43,625
12
83
136

Nice alternative :) I'm using php 7.3, it giving error, unexpected '=>' (T_DOUBLE_ARROW), expecting ')' any way to modify fn to work with this? – i need help Jul 29 '22 at 01:33
https://3v4l.org/bo5M0 – mickmackusa Jul 29 '22 at 01:35
if using this way, how can i easily replace href outside.com with different matches? I tried href="getLink($0)" to read from function getLink($tag){ $arr = array( "拜登"=>"http://bai.com", "认真"=>"http://ren.com", ); return $arr[$tag]; } but it couldn't work.. – i need help Jul 29 '22 at 02:42
1

https://3v4l.org/KALPY – mickmackusa Jul 29 '22 at 02:56

php DOMDocument preg_replace fail detect

1 Answers1