Stripping input to complete plain text

Question

Currently finalising the coding for my comment system, and it want it to work a little how Stack Overflow works with their posts etc, I would like my users to be able to use BOLD, Italic and Underscore only, and to do that I would use following:

_ Text _ * BOLD * -Italic-

Now, firstly I would like to know a way of stripping a comment completely clean of any tags, html entities and such, so for example, if a user was to use any html / php tags, they would be removed from the input.

I am currently using Strip_tags, but that can leave the output looking quite nasty, even if an abusive or blatent XSS/Injection attempt has been made, I would still like the plain-text to be outputted in full, and not chopped up as strip_tags seems to make an absolute mess when it comes to that.

What I will then do, is replace the asterisks with bold html tags, and so on AFTER stripping the content clean of html tags.

How do people suggest I do this, currently this is the comment sanitize function

function cleanNonSQL( $str )
{
    return strip_tags( stripslashes( trim( $str ) ) );
}

Use `htmlspecialchars` without fail before outputting user-provided content and it's all good (there can be no XSS). There are a few fine points (make sure the encoding of the text matches what you tell `htmlspecialchars`, if you are putting user text inside HTML attribute values you have to pay attention to the second parameter) but that's basically it. — Jon, Mar 04 '12 at 13:32

score 1 · Answer 1 · answered Mar 04 '12 at 13:31

1

You could try using regular expressions to strip the tags, such as:

preg_replace("/\<(.+?)\>/", '', $str);

Not sure if that's what you're looking for, but it will remove anything inside < and >. You can also make it a little more foolproof by requiring the first character after the < to be a letter.

answered Mar 04 '12 at 13:31

Ynhockey

3,845
5
33
51

Thank you for this, it has helped! – Jake Ball Mar 04 '12 at 14:11

score 1 · Accepted Answer · answered Mar 04 '12 at 13:32

1

PHP tags are surrounded by <? and ?>, or maybe <% and %>on some ages-old installations, so removing PHP tags can be managed by a regex:

$cleaned=preg_replace('/\<\?.*?\?\>/', '', $dirty);
$cleaned=preg_replace('/\<\%.*?\%\>/', '', $cleaned);

Next you take care of the HTML tags: These are surrounded by < and >. Again you can do this with a regex

$cleaned=preg_replace('/\<.*?\>/','',$cleaned);

This will transform

$dirty="blah blah blah <?php echo $this; ?> foo foo foo <some> html <tag> and <another /> bar bar";

into

$cleaned="blah blah blah  foo foo foo  html  and  bar bar";

answered Mar 04 '12 at 13:32

Eugen Rieck

64,175
10
70
92

Thank you for that, and I assume the trim() function will then take care of the whitespace upon the output? – Jake Ball Mar 04 '12 at 13:40
I didn't touch `trim()` for a reason: Assuming you want (just as in SO) to have lines starting with whitespace to have a special meaning (now or in a future revision), you might simply not want to trim. Only you have the info to balance the perceived need for trimming against current or future problems. – Eugen Rieck Mar 04 '12 at 13:47
Thank you for this, I did not need to use the trim function as you said. I will now learn the syntax of preg_replace and try to output text surround by an asterisk as bold, etc. Thanks – Jake Ball Mar 04 '12 at 14:02
Start with `$bold_decoded=preg_replace('/\*(\w.*?)\*/','$1',$cleaned);` – Eugen Rieck Mar 04 '12 at 14:07
And ofcourse `$ul_decoded=preg_replace('/_(\w.*?)_/','$1',$bold_decoded);`and `$italic_decoded=preg_replace('/-(\w.*?)-/','$1',$ul_decoded);` – Eugen Rieck Mar 04 '12 at 14:10
Thank you for that, however the only issue with that is it does not understand when multiple tags have been used i.e -_*content*_- Is it possible to change that? Also, what does `(\w.*?)` mean in the syntax? – Jake Ball Mar 04 '12 at 14:13
1

My bad: use `preg_replace_all()` instead of `preg_replace()`. `(\w.*?)` means: Match a word character (\w), then anything (.*) but not greedy (?) – Eugen Rieck Mar 04 '12 at 14:15
I don't believe preg_replace_all() is a php function, did you mean preg_match_all() ? Thanks for helping with the syntax, I now understand that – Jake Ball Mar 04 '12 at 14:24
OK, forget `preg_replace_all()`. I misread, you meant multiple on one phrase. Since you only have three possible formats, I'd just regex all combinations: e.g.`preg_replace('/_-(\w.*?)-_/','$1',$txt)` – Eugen Rieck Mar 04 '12 at 14:33

Basti · Answer 3 · 2012-03-04T14:08:47.110

1

The correct way is not to delete html tags from your user's comment, but to tell the browser that the following text should not be interpreted as HTML, Javascript, whatever. Imagine someone wants to post example code like we do here on stackoverflow. If you just bluntly remove any parts of a comment that seem to be code, you will mess up the user's comment.

The solution is to use htmlentities which will escape symbols used for html markup in the comment so that it will actually show up as just text in the browser.

For example the browser will interpret a < as the beginning of a html tag. if you just want the browser to display a <, you have to write < in the source code. htmlentities will convert all the relevant symbols into their html entities for you.

Longer Example

echo htmlentities("<b>this text should not be bold</b><?php echo PHP_SELF;?>");

Outputs

&lt;b&gt;this text should not be bold&lt;/b&gt;&lt;?php echo PHP_SELF;?&gt;

The browser will output

<b>this text should not be bold</b><?php echo PHP_SELF;?>

Consider the following real life example with the solution, you accepted. Imagine a user writing this comment.

i'm in a bad mood today :<. but your blog made me really happy :>

You will now do your preg_replace("/\<(.+?)\>/", '', $comment); on the text and it will remove half the comment:

i'm in a bad mood today :

If that's what you wanted, never mind this answer. If you don't, use htmlentities.

If you want to save the comment as a file and not have the server interpret PHP code inside it, save it with an extension like '.html' or '.txt', so that the web server won't call the PHP interpreter in the first place. There is usually no need to escape PHP code.

edited Mar 04 '12 at 14:08

answered Mar 04 '12 at 13:34

Basti

3,998
1
18
21

Thank you for your input, however the comment system has no real need for users to submit code snippets, it is to simply comment on other users uploads and images, or comment on the site related news. Thanks! – Jake Ball Mar 04 '12 at 13:39
That is not the issue. Even if the users don't post HTML-code or whatever code, they still use symbols, the browser interprets as html code. You generally don't want to delete those symbols, as they may have a different meaning in the comment. If you just blindly delete those symbols, you might end up deleting parts of smileys or mathematical equations or URLs. That will really mess up the comments. Escaping is the way to go. You are providing HTML-code to the browser. If you don't want the user's comment be interpreted as html-code, escape it. – Basti Mar 04 '12 at 13:44
added some more explanation to this issue. hopefully you will now understand the issue. – Basti Mar 04 '12 at 14:14
This works, yes however I simply want it all stripped to plain text, and then the user be able to user BBCODE tags if you could call it that, which will then replace into bold/underline and italic. I already remove links that aren't on my whitelist database to avoid abuse, there are no smileys currently part of this comment system so again not a problem there. I understand how preg_replace will remove entire code blocks, however I do not want the use of tags on client side, so they will simply use underscore for underlined text etc, therefore eliminating the need to use tags. Thanks! – Jake Ball Mar 04 '12 at 14:19
ok you still do not understand the problem, but i ran out of motivation to explain. good luck. – Basti Mar 04 '12 at 14:28
I think I understand what you are trying to say, are you saying that the use of symbols such as `<: :>` will be truncated alongside all text inside? – Jake Ball Mar 04 '12 at 14:30
yes. that happens in my 2nd example. any symbol that is relevant in html will be interpreted by the browser. but many of those symbols are actually used by people when writing comments, texts, whatever. so you have to tell the browser not to interpret a symbol in a user's comment but to take it literally. see http://www.w3schools.com/Html/html_entities.asp this is also relevant if you want to output valid html code. – Basti Mar 04 '12 at 14:33
i looked up a related question explaining why the approach to remove html code with regular expression won't work. in short: regular expressions only match regular languages. html is not regular, but context sensitive. http://stackoverflow.com/questions/701166/can-you-provide-some-examples-of-why-it-is-hard-to-parse-xml-and-html-with-a-reg – Basti Mar 04 '12 at 14:40

Stripping input to complete plain text

3 Answers3