Output HTML safely using PHP

Question

I used stackoverflow to find solution to my problems, so I didn't need to post a question so long. I search for a way to output HTML code but as many of you answered HTMLPurifier is the best solution around.

I find it hard to believe that this is the only way, like isn't supposed that PHP thought on how to clean the input from XSS attacks but still output data?

Htmlentities, htmlspecialchars, strip_tags are not the best candidates for this.

So, the question is: What is?

What I am trying to do is to output user's HTML data from MYSQL safely.

@afuzzyllama : What I am trying to do is to output user's HTML data from MYSQL safely. — ExoVillaro, Aug 28 '11 at 03:10
Define "safely". You mean you want to clean it of certain tags? You want to *escape* it? — deceze, Aug 28 '11 at 03:12
Typically, you should sufficiently sanitize *input* data rather than *output* data. — adlawson, Aug 28 '11 at 03:13
`strip_tags('')`. http://php.net/manual/en/function.strip-tags.php — adlawson, Aug 28 '11 at 03:25
@adlawson : Are you serious ? your site will be attacked in the next 5 mins — ExoVillaro, Aug 28 '11 at 13:29
@exovillaro You asked me "how to sanitize input data with tags like — adlawson, Aug 28 '11 at 13:34

Bailey Parker · Answer 1 · 2011-08-28T03:19:58.613

0

htmlentities works just fine in many cases. However, I believe the best method to prevent things like XSS is whitelisting acceptable characters. For example:

A person's name can have uppercase and lowercase letters, spaces, hyphens, and possibly apostrophes. So full names inputted into your system must match the regex /^[a-z'- ]+$/i.
Examples: Henry Smith, John O'Neil, Heather Fischer-Gardener.

An email can contain the characters uppercase and lowercase A-Z, numbers, pluses, dashes, periods, and the at symbol. So the regex for the email would be: /[a-z0-9-.+]@[a-z0-9-.]+/i.
Examples: jeff.Atwood@stackoverflow.com, spammer123@yahoo.com, leet+hacker@gmail.com, php-list@php.net

You can expand this to fit any data input. Just think about what characters could be typed. The best part about this system, is that you can allow inputs that match the regexes and record inputs that don't. You can look at the log of blocked inputs and see if you need to adjust regexes to allow valid characters or block users attempted to circumvent your security measures.

edited Aug 28 '11 at 03:19

answered Aug 28 '11 at 03:12

Bailey Parker

15,599
5
53
91

1

That's only for the extraordinarily narrow situation when you expect pure English input. "A person's name can have only alphanumeric characters + spaces, hyphen and apostrophe"? Really? – deceze Aug 28 '11 at 03:17
2

Your email regex would be true for `+@9`. **Do not** validate email with regex. http://stackoverflow.com/questions/201323/what-is-the-best-regular-expression-for-validating-email-addresses/201378#201378 – adlawson Aug 28 '11 at 03:18
@deceze This isn't intended for broad inputs. Limited inputs (usually textfields) where only certain characters are acceptable. And no, I'm sure there are more characters that can be used in valid names. The beauty of this system, is it logs what it doesn't accept and you can teach it more valid inputs. For example, if a user tried a name with diacritics, you would learn that they needed to be added as valid input. – Bailey Parker Aug 28 '11 at 03:23
@adlawson I wasn't intending for my email regex to be used in a real situation. There are better regexes out there for validating email, but as you mention they might not be the way to go. You could use `filter_var()` and `FILTER_VALIDATE_EMAIL`. – Bailey Parker Aug 28 '11 at 03:24
I find regex to be vulnerable. – ExoVillaro Aug 28 '11 at 03:25
@ExoVillaro Then use htmlentities. It will serve it's purpose. – Bailey Parker Aug 28 '11 at 03:26
This is going to *extremely* tedious if you want to support anything but English names. Seriously. There are hundreds of european characters with diacritics. Hundreds more for non-latin alphabets. Thousands upon thousands of characters in Asian scripts. Whitelisting is only viable for *extraordinarily narrow situations*. Human names are not that narrow. – deceze Aug 28 '11 at 03:27
The problem I am trying to address here is the knee-jerk reactions we have to vulnerabilities. Maybe my suggestion not the best approach, but if a vulnerability was discovered tomorrow, chances are a restrictive regex would prevent it before PHP could update its sanitization functions. – Bailey Parker Aug 28 '11 at 03:31
@Php Fair enough. Your system may be secure-ish, but it's impractical for most applications. May as well not accept any input at all → perfect security, just not very useful. Your suggestion is too close to that end of the spectrum and has real problems scaling towards the usable-but-still-secure end. – deceze Aug 28 '11 at 03:38
@deceze Agreed. I think I might be taking too much of the overprotective mother approach. My method might need to allow the user a little more input freedom. – Bailey Parker Aug 28 '11 at 03:42
htmlentities WILL NOT OUTPUT HTML.My question is HOW TO OUTPUT HTML. – ExoVillaro Aug 28 '11 at 13:29
1

@ExoVillaro Well that makes things slightly more complicated. Malicious JS could be embedded in a script tag or any of the inline DOM events on attributes. You could use an XML parser and remove any script tag and any on attributes (such as onload, onclick, onkeyup) from user input. However, if you just want to allow the user to input formatted text (bold, italics, etc) consider using a format system (BBCode or maybe something similar to what SO uses). This way the user won't be entering actual HTML that could contain XSS or other bad things. – Bailey Parker Aug 28 '11 at 15:06

Output HTML safely using PHP

1 Answers1