0

This is the error I am trying to correct

<img class="lazy_responsive" title="<a href='kathryn-kuhlman-language-en-topics-718-page-1' title='Kathryn Kuhlman'>Kathryn Kuhlman</a> - iUseFaith.com" src="ojm_thumbnail/1000/32f808f79011a7c0bd1ffefc1365c856.jpg" alt="<a href='kathryn-kuhlman-language-en-topics-718-page-1' title='Kathryn Kuhlman'>Kathryn Kuhlman</a> - iUseFaith.com" width="1600" height="517">

If you look carefully at the code above, you will see that the text in the attribute alt and Title were replaced with a link due to the fact that the keyword was in that text. As a result, my image is being displayed like with a tooltip which gives a link instead of just a name like this enter image description here

Problem: I have an array with keywords where each keyword has its own URL which will serve as a link like this:

$keywords["Kathryn Kuhlman"] = "https://www.iusefaith.com/en-354";
$keywords["Max KANTCHEDE"] = "https://www.iusefaith.com/MaxKANTCHEDE";

I have a text with images and links ... where those keywords may be found.

$text='Meet God\'s General Kathryn Kuhlman. <br>
<img class="lazy_responsive" title="Kathryn Kuhlman - iUseFaith.com" src="https://www.iusefaith.com/ojm_thumbnail/1000/32f808f79011a7c0bd1ffefc1365c856.jpg" alt="Kathryn Kuhlman - iUseFaith.com" width="1600" height="517" />
<br>
Follow <a href="https://www.iusefaith.com/en-354" title="Kathryn Kuhlman">Kathryn Kuhlman</a>
<br>
Max KANTCHEDE
';

I want to replace each keyword with a full link to the keyword with the title without replacing the content of href nor the content of alt nor the content of title that is in the text. I did this

$lien_existants = array();

$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";

if(preg_match_all("/$regexp/siU", $text, $matches, PREG_SET_ORDER)) 
{
    foreach($matches as $match) 
    {
        $lien_actuels_existant = filter_var($match[3], FILTER_SANITIZE_STRING);
        $lien_existants [] = trim($lien_actuels_existant);
          
        // $match[2] = link address
        // $match[3] = link text
        
        echo $match[2], '', $match[3], '<br>';
    }
}   

foreach(@$keywords as $name => $value) 
{
    if(!in_array($name, $lien_existants)&&!preg_match("/'/i", $name)&&!preg_match('/"/i', $name))
    {
        $text =  trim(preg_replace('~(\b'. $name.'\b)~ui', "<a href='$value' title='$name'>$1</a>", $text));
    }
    else
    {
        $name = addslashes($name);
        $text =  trim(preg_replace('~(\b'. $name.'\b)~ui', "<a href='$value' title='$name'>$1</a>", $text));
    }
    ######################################### 
}

This replaces the words with links but also replaces it in the attributes alt, title in images.

How to prevent it from replacing the text from alt, title, and href ?

Note I have tried all the other solutions I have found on S.O so if you think one works kindly use my code above and show me how it should be done because if I knew how to make it work I would not be asking it here.

mickmackusa
  • 43,625
  • 12
  • 83
  • 136
John Max
  • 432
  • 8
  • 23
  • Use the technique [described here](https://stackoverflow.com/a/23667869/3832970). – Wiktor Stribiżew Sep 21 '20 at 16:20
  • @WiktorStribiżew The answer you linked the question to does not meet my needs yet you close the question. Why is that happening on StackOverflow where we can not even ask a question and get a satisfying answer? Are we not here to exchange knowledge? You did not even get me a satisfactory answer. – John Max Sep 21 '20 at 16:36
  • The link you gave me when I use the same procedure, it does not even produce ANY link to ANY text – John Max Sep 21 '20 at 16:38
  • Edit the question to show how you are using `(*SKIP)(*FAIL)` – Wiktor Stribiżew Sep 21 '20 at 16:40
  • The whole reason why I am actually asking a question on SO is because I don't know the answer so before cutting my question off, you were supposed to answer and show me how to use it. If i knew about how to use ```(*SKIP)(*FAIL)```, will I be asking a question here ? Does it make sense ? – John Max Sep 21 '20 at 17:54
  • @WiktorStribiżew I have edited the question – John Max Sep 22 '20 at 17:45
  • I already gave you the link to the solution: match alt or title or whatever you need to keep, put `(*SKIP)(*FAIL)` after those/that alternative, and then use another alternative that will actually get matched. If that does not work, post the expression that you tried. Right now, you have not even tried to solve the problem you are describing. – Wiktor Stribiżew Sep 22 '20 at 18:47
  • I have read and reread the solution but I am not too good with regex so even though I have tried it, it did not work, can you please give me an answer so I may understand how practically this can be done? I have been on the page you gave me for days now. Tks – John Max Sep 24 '20 at 17:53

3 Answers3

2

Regex is not the best way to deal with HTML content.

Here is a solution with DOM manipulation. The code should be self-explanatory with the comment provided.

The idea is to search for all text nodes which are not a link or image children and search/replace the terms you want.

<?php
    
    $keywords["Kathryn Kuhlman"] = "https://www.iusefaith.com/en-354";
    $keywords["Max KANTCHEDE"] = "https://www.iusefaith.com/MaxKANTCHEDE";
    
    $text='Meet God\'s General Kathryn Kuhlman. <br>
    <img class="lazy_responsive" title="Kathryn Kuhlman - iUseFaith.com" src="https://www.iusefaith.com/ojm_thumbnail/1000/32f808f79011a7c0bd1ffefc1365c856.jpg" alt="Kathryn Kuhlman - iUseFaith.com" width="1600" height="517" />
    <br>
    Follow <a href="https://www.iusefaith.com/en-354" title="Kathryn Kuhlman">Kathryn Kuhlman</a>
    <br>
    Max KANTCHEDE
    ';
    
    
    // Format the replacement
    foreach($keywords as $name => &$value) {
        $value = '<a href="'.$value.'" title="'.$name.'">'.$name.'</a>';
    }
    
    // Load a DomDocument with our html
    $doc = new DOMDocument();
    $doc->loadHTML('<html><body>' . $text . '</body></html>');
    
    // Search through xpath all text elements which are not parent of an img or a element
    $xpath = new DOMXPath($doc);
    $textnodes = $xpath->query('//*[not(self::img or self::a)]/text()');
    
    // For each text node replace words found by the link
    foreach($textnodes as $textnode) {
        $html = str_replace(array_keys($keywords), array_values($keywords), $textnode->nodeValue, $count);
        if ($count) {
            $newelement = $doc->createDocumentFragment();
            $newelement->appendXML($html);
            $textnode->parentNode->replaceChild($newelement, $textnode);
        }
    }
    
    // Retrieve body html
    $body_element = $doc->getElementsByTagName('body');
    $body = $doc->savehtml($body_element->item(0));
    
    // Remove wrapping <body></body>
    echo substr($body, 6, strlen($body)-13);
     

You can use str_ireplace instead of str_replace for a case insensitive search

Jiwoks
  • 519
  • 5
  • 18
2

I think @Jiwoks' answer was on the right path with using dom parsing calls to isolate the qualifying text nodes.

While his answer works on the OP's sample data, I was unsatisfied to find that his solution failed when there was more than one string to be replaced in a single text node.

I've crafted my own solution with the goal of accommodating case-insensitive matching, word-boundary, multiple replacements in a text node, and fully qualified nodes being inserted (not merely new strings that look like child nodes).

Code: (Demo #1 with 2 replacements in a text node) (Demo #2: with OP's text)
(After receiving fuller, more realistic text from the OP: Demo #3 without trimming saveHTML())

$html = <<<HTML
Meet God's General Kathryn Kuhlman. <br>
<img class="lazy_responsive" title="Kathryn Kuhlman - iUseFaith.com" src="https://www.iusefaith.com/ojm_thumbnail/1000/32f808f79011a7c0bd1ffefc1365c856.jpg" alt="Kathryn Kuhlman - iUseFaith.com" width="1600" height="517" />
<br>
Follow <a href="https://www.iusefaith.com/en-354" title="Kathryn Kuhlman">Kathryn Kuhlman</a>
<br>
Max KANTCHEDE & Kathryn Kuhlman
HTML;

$keywords = [
    'Kathryn Kuhlman' => 'https://www.example.com/en-354',
    'Max KANTCHEDE' => 'https://www.example.com/MaxKANTCHEDE',
    'eneral' => 'https://www.example.com/this-is-not-used',
];

libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

$xpath = new DOMXPath($dom);

$lookup = [];
$regexNeedles = [];
foreach ($keywords as $name => $link) {
    $lookup[strtolower($name)] = $link;
    $regexNeedles[] = preg_quote($name, '~');
}
$pattern = '~\b(' . implode('|', $regexNeedles) . ')\b~i' ;

foreach($xpath->query('//*[not(self::img or self::a)]/text()') as $textNode) {
    $newNodes = [];
    $hasReplacement = false;
    foreach (preg_split($pattern, $textNode->nodeValue, 0, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE) as $fragment) {
        $fragmentLower = strtolower($fragment);
        if (isset($lookup[$fragmentLower])) {
            $hasReplacement = true;
            $a = $dom->createElement('a');
            $a->setAttribute('href', $lookup[$fragmentLower]);
            $a->setAttribute('title', $fragment);
            $a->nodeValue = $fragment;
            $newNodes[] = $a;
        } else {
            $newNodes[] = $dom->createTextNode($fragment);
        }
    }
    if ($hasReplacement) {
        $newFragment = $dom->createDocumentFragment();
        foreach ($newNodes as $newNode) {
            $newFragment->appendChild($newNode);
        }
        $textNode->parentNode->replaceChild($newFragment, $textNode);
    }
}
echo substr(trim($dom->saveHTML()), 3, -4);

Output:

Meet God's General <a href="https://www.example.com/en-354" title="Kathryn Kuhlman">Kathryn Kuhlman</a>. <br>
<img class="lazy_responsive" title="Kathryn Kuhlman - iUseFaith.com" src="https://www.iusefaith.com/ojm_thumbnail/1000/32f808f79011a7c0bd1ffefc1365c856.jpg" alt="Kathryn Kuhlman - iUseFaith.com" width="1600" height="517">
<br>
Follow <a href="https://www.iusefaith.com/en-354" title="Kathryn Kuhlman">Kathryn Kuhlman</a>
<br>
<a href="https://www.example.com/MaxKANTCHEDE" title="Max KANTCHEDE">Max KANTCHEDE</a> &amp; <a href="https://www.example.com/en-354" title="Kathryn Kuhlman">Kathryn Kuhlman</a>

Some explanatory points:

  • I am using some DomDocument silencing and flags because the sample input is missing a parent tag to contain all of the text. (There is nothing wrong with @Jiwoks' technique, this is just a different one -- choose whatever you like.)
  • A lookup array with lowercased keys is declared to allow case-insensitive translations on qualifying text.
  • A regex pattern is dynamically constructed and therefore should be preg_quote()ed to ensure that the pattern logic is upheld. b is a word boundary metacharacter to prevent matching a substring in a longer word. Notice that eneral is not replaced in General in the output. The case-insensitive flag i will allow greater flexibility for this application and future applications.
  • My xpath query is identical to @Jiwoks'; if see no reason to change it. It is seeking text nodes that are not the children of <img> or <a> tags.

...now it gets a little fiddly... Now that we are dealing with isolated text nodes, regex can be used to differentiate qualifying strings from non-qualifying strings.

  • preg_split() is creating a flat, indexed array of non-empty substrings. Substrings which qualify for translation will be isolated as elements and if there are any non-qualifying substrings, they will be isolated elements.

    • The final text node in my sample will generate 4 elements:

      0 => '
      ',                                 // non-qualifying newline
      1 => 'Max KANTCHEDE',              // translatable string
      2 => ' & ',                        // non-qualifying text
      3 => 'Kathryn Kuhlman'             // translatable string
      
  • For translatable strings, new <a> nodes are created and filled with the appropriate attributes and text, then pushed into a temporary array.

  • For non-translatable strings, text nodes are created, then pushed into a temporary array.

  • If any translations/replacements have been done, then dom is updated; otherwise, no mutation of the document is necessary.

  • In the end, the finalized html document is echoed, but because your sample input has some text that is not inside of tags, the temporary leading <p> and trailing </p> tag that DomDocument applied for stability must be removed to restore the structure to its original form. If all text is enclosed in tags, you can just use saveHTML() without any hacking at the string.

mickmackusa
  • 43,625
  • 12
  • 83
  • 136
  • This is VERY good but it breaks when I put a lot of text and gives this kind of error ```

    “William J.Seymour, The Catalyst of Pentecost”

    ``` Please see a demo here https://www.iusefaith.com/brouillons.php and you will see the break at the first line. You can access the text I used to test it at https://www.iusefaith.com/brouillons_b.php Please Can you help me fix it? ?
    – John Max Sep 27 '20 at 17:47
  • I am at work now. I will revisit when I find time. Try without the trimming at the end (just `echo $dom->saveHTML()`). Your sample data in the question didn't have all text inside of tags. https://3v4l.org/e1mNK – mickmackusa Sep 27 '20 at 22:36
  • This worked but omitted words like ```Kathryn Kuhlman’s``` as in there were words that were not made a link. – John Max Sep 28 '20 at 12:48
  • Please provide another sample input which breaks my solution so that I can see for myself. I am actually disheartened that you selected a regex solution after Stack Overflow has been banging on for YEARs about how inappropriate regex is in parsing html. When regex answers to html parsing question are awarded the green tick researchers will be confused about the message that "regex should not be used to reliably parse html content". This link proves that my answer converts the name into a hyperlink: https://3v4l.org/QBEHm Feeling ripped off and disappointed that you are going to use a hack. – mickmackusa Sep 28 '20 at 13:44
  • So sorry for the disappointment, I wish I could just give you some of my rep in compensation for all the efforts. When I tested, there was one of the names that did not show, the last one. If there was a way for me to add you also 50 out of my rep, you can let me know i will. My bad .. I did not know S.O recommended not to use regex and the regex what what was not breaking when I tested it – John Max Sep 28 '20 at 15:07
  • 1
    https://stackoverflow.com/a/1732454/2943403 ...and regarding rep, I no longer care about rep because their is no longer a benefit once all privileges are unlocked. So don't bother blowing another bounty on me. I just want you to use reliable code in your application and researchers to find great, educational, professional, empowering content on this site. – mickmackusa Sep 28 '20 at 16:39
0

This is possible using regex by temporarily prepending a unique "marker string" before all keywords that you don't want to replace - see this regex101 demo and the following code:

// Define a marker string - could be anything that is very unlikely to appear in the
// text. (But don't include any characters that would need to be escaped in a regex).
$marker = '¬¦@#~';

// Construct regex alternation syntax for all the keywords.
// E.g: (Kathryn Kuhlman|Max KANTCHEDE|Another one)
$alt_keywords = '('.join('|', array_keys($keywords)).')';

// Double quotes: Prepend marker to keywords in href="...", alt="..." or title="..."
$text = preg_replace(
    '/((?:href|alt|title)\s*=\s*"[^"]*)'.$alt_keywords.'/',
    "$1$marker$2",
    $text);

// Single quotes: Prepend marker to keywords in href='...', alt='...' or title='...'
$text = preg_replace(
    "/((?:href|alt|title)\s*=\s*'[^']*)$alt_keywords/",
    "$1$marker$2",
    $text);

// Optional step - not explicitly requested in the question but seems necessary:
// Prepend marker to keywords found within anchor tags / end tags: <a>...</a>
$text = preg_replace(
    "/(<a(?:\s+[^>]*)?>[^<]*)$alt_keywords([^<]*<\/a\s*>)/",
    "$1$marker$2",
    $text);

Negative lookbehind can then be used to only make replacements where the marker text isn't present - see this regex101 demo and the following code:

foreach($keywords as $name => $url) {
  $text = preg_replace(
      "/(?<!$marker)$name/",
      "<a href=\"$url\" title=\"$name\">$name</a>",
      $text);
}

// Now clean up by removing all instances of the marker text
$text = str_replace($marker, '', $text);

Demo

This Rextester demo shows the code above working for the example values in the question.

Steve Chambers
  • 37,270
  • 24
  • 156
  • 208
  • Please justify why you are using regex to parse html. Please explain why you are using pattern modifiers `g` and `m`. Why does your answer use hardcoded values instead of respecting/using the OP's `$keywords` variable? The use of a random "marker" then mopping it up at the end seems an indirect/suboptimal/hackish solution to me. – mickmackusa Sep 25 '20 at 20:55
  • The question was tagged with regex - this answer is providing one possible solution that uses it. I won't comment on whether it's a hack/suboptimal as this seems a bit subjective - am just putting it out there as one possibility to pick from. The other points are valid - `g` and `m` weren't needed so have removed them and have also now incorporated the `$keywords` associative array. – Steve Chambers Sep 26 '20 at 20:55
  • 1
    Your updated answer was not tested. It breaks with a parsing error. – mickmackusa Sep 26 '20 at 22:48
  • Have now fixed the error and added a demo. Also added an additional (optional) step to no longer replace keywords that appear within anchor tags - e.g. `Kathryn Kuhlman`. – Steve Chambers Sep 27 '20 at 16:41
  • 1
    This is good but it removes the title from the link. The purpose was also to keep the title in the link without removing it. So every time a text is replaced by a link, a title should be added to the text in the link so when we mouse over the link it should display the content in a text. Can you please edit that? – John Max Sep 27 '20 at 17:25
  • OK - have now added the titles. – Steve Chambers Sep 27 '20 at 20:13