2

I have a forum that supports hashtags. I'm using the following line to convert all hashtags into links. I'm using the (^|\(|\s|>) pattern to avoid picking up named anchors in URLs.

$str=preg_replace("/(^|\(|\s|>)(#(\w+))/","$1<a href=\"/smalltalk.php?Tag=$3&amp;".SID."\">$2</a>",$str);

I'm using this line to pick up hashtags to store them in a separate field when the user posts their message, this picks up all hashtags EXCEPT those at the start of a new line.

preg_match_all("/(^|\(|\s|>)(#(\w+))/",$Content,$Matches);

Using the m & s modifiers doesn't make any difference. What am I doing wrong in the second instance?

Edit: the input text could be plain text or HTML. Example of problem input:

#startoftextreplacesandmatches #afterwhitespacereplacesandmatches <b>#insidehtmltagreplacesandmatches</b> :)
#startofnewlinereplacesbutdoesnotmatch :(
biegleux
  • 13,179
  • 11
  • 45
  • 52
Orinoco
  • 167
  • 1
  • 11
  • The hashtags are placed into which kind of text? Plain text? HTML? BBCode? Markdown? Letters carved into stone plates? – hakre Sep 02 '12 at 15:21
  • The text could be plain text or HTML – Orinoco Sep 02 '12 at 15:53
  • In case of HTML, I suggest you should take care about hashtags in a text like: `a #tag this is` which would be the hashtag `#tag` probably. If so (or some of the other common things that could happen), you might be interested in this question and answer: [Ignore html tags in preg_replace](http://stackoverflow.com/q/8193327) – hakre Sep 02 '12 at 16:00

1 Answers1

2

Your replace operation has a problem which you have evidently not yet come across - it will allow unescaped HTML special characters through. The reason I know this is because your regex allows hashtags to be prefixed with >, which is a special character.

For that reason, I recommend you use this code to do the replacement, which will double up as the code for extracting the tags to be inserted into the database:

$hashtags = array();

$expr = '/(?:(?:(^|[(>\s])#(\w+))|(?P<notag>.+?))/';

$str = preg_replace_callback($expr, function($matches) use (&$hashtags) {
    if (!empty($matches['notag'])) {
        // This takes care of HTML special characters outside hashtags
        return htmlspecialchars($matches['notag']);
    } else {
        // Handle hashtags
        $hashtags[] = $matches[2];
        return htmlspecialchars($matches[1]).'<a href="/smalltalk.php?Tag='.htmlspecialchars(urlencode($matches[2])).'&amp;'.SID.'">#'.htmlspecialchars($matches[2]).'</a>';
    }
}, $str);

After the above code has been run, $str will contain the modified string, properly escaped for direct output, and $hashtags will be populated with all the tags matched.

See it working

DaveRandom
  • 87,921
  • 11
  • 154
  • 174
  • Tried to test but got Parse error: syntax error, unexpected T_FUNCTION in /blahblahblah/smallpost.php on line 180. The input text could already be HTML from TinyEditor, I've already stripped unwanted tags & escaped necessary characters before I hit this code. The preg_replace is working, it is the preg_match_all that is not. – Orinoco Sep 02 '12 at 16:04
  • @user1641839 Well in that case the same principle applies, it just means it can be simplified: http://codepad.viper-7.com/jfmDvG – DaveRandom Sep 02 '12 at 16:08
  • @user1641839 The parse error is because you are on PHP < 5.3, one sec I'll give you a working version for that – DaveRandom Sep 02 '12 at 16:08
  • @user1641839 OK, how about this? http://codepad.viper-7.com/D1G7hr - I had to wrap it in an object because PHP 5.2 does not have a mechanism for inheriting variables from the current scope. I know your existing replace code is working, but my point is that by combining it with the tag search you can solve the problem you have with that, and also make your code more efficient because you only have to run the regex once. – DaveRandom Sep 02 '12 at 16:21
  • In working through this code I discovered the real problem - the match was being performed after the text was being escaped prior to being inserted via sql which was breaking the linebreaks, the replace had no such escaping because it was being performed on text coming out of the table, doh! You're right that your code is more efficient though so I will be making use of it. Thank you for your help. – Orinoco Sep 02 '12 at 17:26