1

I'm trying to solve this bug in Drupal's Hashtags module: http://drupal.org/node/1718154

I've got this function that matches every word in my text that is prefixed by "#", like #tag:

function hashtags_get_tags($text) {
    $tags_list = array();
    $pattern = "/#[0-9A-Za-z_]+/";
    preg_match_all($pattern, $text, $tags_list);
    $result = implode(',', $tags_list[0]);
    return $result;
    }

I need to ignore internal links in pages, such as <a href="#reference">link</a>, or, more in general, any word prefixed by # that appears inside an HTML tag (so preceeded by < and followed by >).

Any idea how can I achieve this?

gerlos
  • 490
  • 5
  • 13
  • 2
    Obligatory warning: You'll run into trouble trying to match HTML using regex. For matching hashtags in a limited set of HTML on small amounts of text, I'm guessing the worst case scenario is probably mangled looking content. But it's easy to get this wrong, and it's easy to introduce security problems when using regex on HTML. Be very, very careful. – Carson Myers Aug 08 '12 at 02:43
  • Someone always links to this: [Parsing HTML with RegEx](http://stackoverflow.com/a/1732454/1421049). – uınbɐɥs Aug 08 '12 at 03:33
  • actually, I think I can restrict my requirements: most of the time I want to ignore "hashtags" inside `` tags... – gerlos Aug 08 '12 at 03:34

3 Answers3

1

Can you strip the tags first because matching (using the strip_tags function)?

function hashtags_get_tags($text) {

    $text = strip_tags($text);

    $tags_list = array();
    $pattern = "/#[0-9A-Za-z_]+/";
    preg_match_all($pattern, $text, $tags_list);
    $result = implode(',', $tags_list[0]);
    return $result;
}

A regular expression is going to be tricky if you want to only match hashtags that are not inside an HTML tag.

Jon Lin
  • 142,182
  • 29
  • 220
  • 220
0

You could throw out the tags before hand using preg_replace

function hashtags_get_tags($text) {
$tags_list = array();
$pattern = "/#[0-9A-Za-z_]+/";
$text=preg_replace("/<[^>]*>/","",$text);
preg_match_all($pattern, $text, $tags_list);
$result = implode(',', $tags_list[0]);
return $result;
}
lafuzz
  • 74
  • 5
0

I made this function using PHP DOM.

It returns all links that have # in the href.

If you want it to only remove internal hash tags, replace this line:

if(strpos($link->getAttribute('href'), '#') === false) {

with this:

if(strpos($link->getAttribute('href'), '#') !== 0) {

This is the function:

function no_hashtags($text) {
    $doc = new DOMDocument();
    $doc->loadHTML($text);
    $links = $doc->getElementsByTagName('a');
    $nohashes = array();
    foreach($links as $link) {
        if(strpos($link->getAttribute('href'), '#') === false) {
            $temp = new DOMDocument();
            $elem = $temp->importNode($link->cloneNode(true), true);
            $temp->appendChild($elem);
            $nohashes[] = $temp->saveHTML();
        }
    }
    // return $nohashes;
    return implode('', $nohashes);
    // return implode(',', $nohashes);
}
uınbɐɥs
  • 7,236
  • 5
  • 26
  • 42