How to replace glossary terms in HTML text with links?

Question

I would like to run a str_replace or preg_replace which looks for certain words (found in $glossary_terms) in my $content and replaces them with links (like <a href="/glossary/initial/term">term</a>).

However, the $content is full HTML and my links/images are being affected too, which isn't what I'm after.

An example of $content is:

<div id="attachment_542" class="wp-caption alignleft" style="width: 135px"><a href="http://www.seriouslyfish.com/dev/wp-content/uploads/2011/12/Amazonas-English-1.jpg"><img class="size-thumbnail wp-image-542" title="Amazonas English" src="http://www.seriouslyfish.com/dev/wp-content/uploads/2011/12/Amazonas-English-1-288x381.jpg" alt="Amazonas English" width="125" height="165" /></a><p class="wp-caption-text">Amazonas Magazine - now in English!</p></div>
<p>Edited by Hans-Georg Evers, the magazine &#8216;Amazonas&#8217; has been widely-regarded as among the finest regular publications in the hobby since its launch in 2005, an impressive achievment considering it&#8217;s only been published in German to date. The long-awaited English version is just about to launch, and we think a subscription should be top of any serious fishkeeper&#8217;s Xmas list&#8230;</p>
<p>The magazine is published in a bi-monthly basis and the English version launches with the January/February 2012 issue with distributors already organised in the United States, Canada, the United Kingdom, South Africa, Australia, and New Zealand. There are also mobile apps availablen which allow digital subscribers to read on portable devices.</p>
<p>It&#8217;s fair to say that there currently exists no better publication for dedicated hobbyists with each issue featuring cutting-edge articles on fishes, invertebrates, aquatic plants, field trips to tropical destinations plus the latest in husbandry and breeding breakthroughs by expert aquarists, all accompanied by excellent photography throughout.</p>
<p>U.S. residents can subscribe to the printed edition for just $29 USD per year, which also includes a free digital subscription, with the same offer available to Canadian readers for $41 USD or overseas subscribers for $49 USD. Please see the <a href="http://www.amazonasmagazine.com/">Amazonas website</a> for further information and a sample digital issue!</p>
<p>Alternatively, subscribe directly to the print version <a href="https://www.amazonascustomerservice.com/subscribe/index2.php">here</a> or digital version <a href="https://www.amazonascustomerservice.com/subscribe/digital.php">here</a>. Just gonna add this to the end of the post so I can do some testing.</p>

I came across this link, but I wasn't sure if such a method would work with nested HTML.

Is there any way I can str_replace or preg_replace content within <p> tags only; excluding any nested <a>, <img> or <h1/2/3/4/5> tags?

Thanks in advance,

possible duplicate of [str_replace within certain html tags only](http://stackoverflow.com/questions/3172493/str-replace-within-certain-html-tags-only) — Mark Baker, Feb 20 '12 at 09:45
A possible duplicate? I referenced that topic and stated, "wasn't sure if such a method would work with nested HTML". — turbonerd, Feb 20 '12 at 10:16
@dunc: Use `$xpath->query("//text()[not(parent::a) and contains(., '$glossary_term')]")` and you are all set. The `//` part takes care of the nesting. — Tomalak, Feb 20 '12 at 11:44
@dunc - Clearly you didn't read the linked answers properly, the accepted answer uses DomDocument and XPath to do the work, and you're strongly recommended not to even consider using str_replace or preg_replace — Mark Baker, Feb 20 '12 at 12:09
On the contrary I read the entire thread, especially the accepted answer. However, I've never come across such functions before and it wasn't clear to me whether or not they would do exactly what I needed. I also didn't think it was prudent or appropriate to bump a question from July 2010. — turbonerd, Feb 20 '12 at 12:12

Tomalak · Accepted Answer · 2012-02-23T15:30:50.380

A "by-the-book solution" would be like this:

<?php

$html = "<your HTML string>";
$glossary_terms = array('fishes', 'invertebrates', 'aquatic plants');

$dom = new DOMDocument;
$dom->loadHTML($html);

dom_link_glossary($dom, $glossary_terms);

echo $dom->saveHTML();

// wraps all occurrences of the glossary terms in links
function dom_link_glossary(&$document, &$glossary) {
  $xpath   = new DOMXPath($document);
  $urls    = array();
  $pattern = array();

  // build a normalized lookup (case-insensitive, whitespace-agnostic)
  foreach ($glossary as $term) {
    $term_norm = preg_replace('/\s+/', ' ', strtoupper(trim($term)));
    $pattern[] = preg_replace('/ /', '\\s+', preg_quote($term_norm));
    $urls[$term_norm] = '/glossary/initial/' . rawurlencode($term);
  }

  $pattern  = '/\b(' . implode('|', $pattern) . ')\b/i';
  $text_nodes = $xpath->query('//text()[not(ancestor::a)]');

  foreach($text_nodes as $original_node) {
    $text     = $original_node->nodeValue;
    $hitcount = preg_match_all($pattern, $text, $matches, PREG_OFFSET_CAPTURE);

    if ($hitcount == 0) continue;

    $offset   = 0;
    $parent   = $original_node->parentNode;
    $refnode  = $original_node->nextSibling;

    $parent->removeChild($original_node);

    foreach ($matches[0] as $i => $match) {
      $term_txt = $match[0];
      $term_pos = $match[1];
      $term_norm = preg_replace('/\s+/', ' ', strtoupper($term_txt));

      // insert any text before the term instance
      $prefix = substr($text, $offset, $term_pos - $offset);
      $parent->insertBefore($document->createTextNode($prefix), $refnode);

      // insert the actual term instance as a link
      $link = $document->createElement("a", $term_txt);
      $link->setAttribute("href", $urls[$term_norm]);
      $parent->insertBefore($link, $refnode);

      $offset = $term_pos + strlen($term_txt);

      if ($i == $hitcount - 1) {  // last match, append remaining text
        $suffix = substr($text, $offset);
        $parent->insertBefore($document->createTextNode($suffix), $refnode);
      }
    }
  }
}
?>

Here is how dom_link_glossary() works:

It normalizes the glossary terms (trim, uppercase, white-space) and builds a lookup array and a regex pattern that matches all terms.
It uses XPath to find all text nodes that are not already part of a link. Text nodes are returned irrespective of their nesting depth (i.e. no recursion necessary on our part). I use \b to prevent partial matches.
For each text node that contains terms:
- The original text node is deleted ($parent->removeChild())
- Now new nodes are created and inserted into the DOM: text nodes for anything before (or after) a glossary term, element nodes (<a>) for the actual glossary terms.

The solution preserves original case and white space, therefore

term will become <a href="/glossary/initial/term">term</a>
Term will become <a href="/glossary/initial/term">Term</a>
Foo Bar will become <a href="/glossary/initial/foo%20bar">Foo Bar</a>. Surplus whitespace or line breaks in the HTML will not break the mechanism.

Note that it is perfectly all-right to use regex on the plain text node values. It is not okay to use regex on full HTML.

I would recommend pairing the glossary terms with their respective URLs in an array, instead of calculating the URLs in the function. That way you can make multiple terms point to the same URL.

Sorry Tomalak - for some reason I haven't seen this post until I logged on to write another question. I'll be trying this tonight, many thanks. — turbonerd, Feb 27 '12 at 11:12
Hi Tomalak. I've implemented the script but it's linking every single space: `All species in`. Any ideas? — turbonerd, Feb 28 '12 at 01:27
@dunc I've tested the function and it definitely is not doing that for me. -- If you look closely, it is not actually linking the spaces. Looking into my crystal ball: Can it be that your `$glossary_terms` contains the empty string? — Tomalak, Feb 28 '12 at 06:49
That's exactly what I thought @Tomalak but the full array being used for the wordlist (well, a cut down sample of what I want to use, but I still can't see any issues with it) is this: http://pastebin.com/wNPby2U3 — turbonerd, Feb 28 '12 at 10:53
OK, scrap that. I was doing something strange which stopped your code from working - my apologies. I've spent the past hour or so trying to work out why your code would work with your glossary terms but not mine - the `wet/dry filter` seems to be the problem! :) Fixed now, thank you so very much for your help. — turbonerd, Feb 28 '12 at 11:52
Hi, Tomalak, I am trying to make it work in my case. I will just do text_1 to link_1 replacement one string at a time in a loop a html_string. for example, i will make "AAA BBB CCC XXX YY Z" to "AAA BBB CCC XXX YY Z". I don't know DOM and having trouble make it work. — likeforex.com, Jul 22 '12 at 20:59
@likeforex I'm sorry to hear that but by "it doesn't work" alone it is impossible to help you. Also it seems you are either not even using the above code or are not understanding it at all. **Tip #1:** *Never* copy code off the Internet that you do not fully understand. **Tip 2:** Do a search on "PHP links replace" (or similar) - there are literally thousands of examples on this site alone. Find a simpler one to start with if the above is too complex. **Tip 3:** If nothing else helps, ask a new question. Give actual code examples and the error you get. — Tomalak, Jul 23 '12 at 00:33
i read and understood what you did. what i tried is: I want to do a single term to a single link replacement when the term is not within a link itself via: dom_link_glossary($dom, $term, $url); the hard part is to know whether the term is inside a html link tag or text. — likeforex.com, Jul 23 '12 at 02:45
@likeforex If you understand the code, then all you need to add is a `if ($original_node->parentNode->nodeName !== 'A')` check in the foreach loop. — Tomalak, Jul 23 '12 at 03:03

score 0 · Answer 2 · answered Feb 20 '12 at 09:56

0

You can try this:

$content = preg_replace('/(<p\sclass=\"wp\-caption\-text\">)[^<]+(<\/p>)/i', '', $content);

answered Feb 20 '12 at 09:56

Sabbir Ahmed Chowdhury

1
2

Can you explain that line and what it does exactly? I'm not great with regex. – turbonerd Feb 20 '12 at 10:02
well, as you can see, inside the preg_replace function, first parameter is
tag. And between the tag there is [^] which means anything between the tag. You will also see / before
and /i after
which defines start and end. Then in second parameter, there is an empty string to replace anything between the
tag with that (here you can set your own string to replace). I guess that should help. – Sabbir Ahmed Chowdhury Feb 20 '12 at 10:13
@dunc: You should never use regex to mess with HTML. *Especially* when you are not great with regex. – Tomalak Feb 20 '12 at 11:46
Yeah I'd prefer not to Tomalak - thanks for replying to my original question. If you'd like to offer it as an answer I'd happily give you "the tick" :) Also, if you have any links on using `$xpath->query` that would be very helpful. – turbonerd Feb 20 '12 at 11:48

How to replace glossary terms in HTML text with links?

2 Answers2

Linked