7

How do I ignore html tags in this preg_replace. I have a foreach function for a search, so if someone searches for "apple span" the preg_replace also applies a span to the span and the html breaks:

preg_replace("/($keyword)/i","<span class=\"search_hightlight\">$1</span>",$str);

Thanks in advance!

hakre
  • 193,403
  • 52
  • 435
  • 836
Fabian
  • 3,465
  • 4
  • 34
  • 42

1 Answers1

5

I assume you should make your function based on DOMDocument and DOMXPath rather than using regular expressions. Even those are quite powerful, you run into problems like the one you describe which are not (always) easily and robust to solve with regular expressions.

The general saying is: Don't parse HTML with regular expressions.

It's a good rule to keep in mind and albeit as with any rule, it does not always apply, it's worth to make up one's mind about it.

XPath allows you so find all texts that contain the search terms within texts only, ignoring all XML elements.

Then you only need to wrap those texts into the <span> and you're done.

Edit: Finally some code ;)

First it makes use of xpath to locate elements that contain the search text. My query looks like this, this might be written better, I'm not a super xpath pro:

'//*[contains(., "'.$search.'")]/*[FALSE = contains(., "'.$search.'")]/..'

$search contains the text to search for, not containing any " (quote) character (this would break it, see Cleaning/sanitizing xpath attributes for a workaround if you need quotes).

This query will return all parents that contain textnodes which put together will be a string that contain your search term.

As such a list is not easy to process further as-is, I created a TextRange class that represents a list of DOMText nodes. It is useful to do string-operations on a list of textnodes as if they were one string.

This is the base skeleton of the routine:

$str = '...'; # some XML

$search = 'text that span';

printf("Searching for: (%d) '%s'\n", strlen($search), $search);

$doc = new DOMDocument;
$doc->loadXML($str);
$xp = new DOMXPath($doc);

$anchor = $doc->getElementsByTagName('body')->item(0);
if (!$anchor)
{
    throw new Exception('Anchor element not found.');
}

// search elements that contain the search-text
$r = $xp->query('//*[contains(., "'.$search.'")]/*[FALSE = contains(., "'.$search.'")]/..', $anchor);
if (!$r)
{
    throw new Exception('XPath failed.');
}

// process search results
foreach($r as $i => $node)
{   
    $textNodes = $xp->query('.//child::text()', $node);

    // extract $search textnode ranges, create fitting nodes if necessary
    $range = new TextRange($textNodes);        
    $ranges = array();
    while(FALSE !== $start = strpos($range, $search))
    {
        $base = $range->split($start);
        $range = $base->split(strlen($search));
        $ranges[] = $base;
    };

    // wrap every each matching textnode
    foreach($ranges as $range)
    {
        foreach($range->getNodes() as $node)
        {
            $span = $doc->createElement('span');
            $span->setAttribute('class', 'search_hightlight');
            $node = $node->parentNode->replaceChild($span, $node);
            $span->appendChild($node);
        }
    }
}

For my example XML:

<html>
    <body>
        This is some <span>text</span> that span across a page to search in.
    and more text that span</body>
</html>

It produces the following result:

<html>
    <body>
        This is some <span><span class="search_hightlight">text</span></span><span class="search_hightlight"> that span</span> across a page to search in.
    and more <span class="search_hightlight">text that span</span></body>
</html>

This shows that this even allows to find text that is distributed across multiple tags. That's not that easily possible with regular expressions at all.

You find the full code here: http://codepad.viper-7.com/U4bxbe (including the TextRange class that I have taken out of the answers example).

It's not working properly on the viper codepad because of an older LIBXML version that site is using. It works fine for my LIBXML version 20707. I created a related question about this issue: XPath query result order.

A note of warning: This example uses binary string search (strpos) and the related offsets for splitting textnodes with the DOMText::splitText function. That can lead to wrong offsets, as the functions needs the UTF-8 character offset. The correct method is to use mb_strpos to obtain the UTF-8 based value.

The example works anyway because it's only making use of US-ASCII which has the same offsets as UTF-8 for the example-data.

For a real life situation, the $search string should be UTF-8 encoded and mb_strpos should be used instead of strpos:

 while(FALSE !== $start = mb_strpos($range, $search, 0, 'UTF-8'))
Community
  • 1
  • 1
hakre
  • 193,403
  • 52
  • 435
  • 836
  • +1 for using tried and true tools for html parsing/manipulation – Rob Apodaca Nov 19 '11 at 12:15
  • thank you for your explanation. i really appreciate this! i would be glad if you could include an example, but i'll also read into DOMDocument and DOMXPath myself. thanks! – Fabian Nov 19 '11 at 17:03
  • @Fabian: I'm having an example running on my machine but couldn't get it to run on an online codepad because of some differences. I try to figure out how to work around that in another question I just was ready to post: http://stackoverflow.com/q/8195733/367456 - I'll post the code here if I find a solution for the problem so it's proper. The code so far is here: http://codepad.viper-7.com/U4bxbe - however it doesn't work for the first text that goes over the `` while it works on my dev-box. – hakre Nov 19 '11 at 17:11
  • thank you for your efforts. i'll look into your example and will read some tutorials to understand everything as this is new to me. i'll also follow you question, but it seems that it's an version bug from codepad, right? – Fabian Nov 19 '11 at 17:52
  • @Fabian: I updated the answer with some more information and an explanation what I did for the solution. Hope this is helpful. – hakre Nov 19 '11 at 18:10
  • @Fabian: Yes, it's a bug with the XMLLIB version on codepad. I'm currently looking for more authorative information about that issue. – hakre Nov 19 '11 at 18:22
  • wow, thank you so much for this detailed answer! i'll try to recode my search with xpath. it looks much more cleaner than with regular expressions. – Fabian Nov 19 '11 at 18:33
  • Codepad Viper is down, the source-code of the TextRange class is here as well: https://gist.github.com/gists/1894360/ – hakre Feb 23 '12 at 19:07
  • 1
    I had to use the following for the XPath, otherwise it wasn't finding matching nodes without children: "//*[contains(., '$search')]/*[FALSE = contains(., '$search')]/..|//*[contains(., '$search') and count(*)=0]" – David Alan Hjelle May 03 '12 at 19:19
  • 1
    TextRange class is available at: https://gist.github.com/hakre/1894360 – jfaron Oct 01 '20 at 16:29