What is the correct way to detect whether string inputs contain HTML or not?

Question

When receiving user input on forms I want to detect whether fields like "username" or "address" does not contain markup that has a special meaning in XML (RSS feeds) or (X)HTML (when displayed).

So which of these is the correct way to detect whether the input entered doesn't contain any special characters in HTML and XML context?

if (mb_strpos($data, '<') === FALSE AND mb_strpos($data, '>') === FALSE)

or

if (htmlspecialchars($data, ENT_NOQUOTES, 'UTF-8') === $data)

or

if (preg_match("/[^\p{L}\-.']/u", $text)) // problem: also caches symbols

Have I missed anything else,like byte sequences or other tricky ways to get markup tags around things like "javascript:"? As far as I'm aware, all XSS and CSFR attacks require < or > around the values to get the browser to execute the code (well at least from Internet Explorer 6 or later anyway) - is this correct?

I am not looking for something to reduce or filter input. I just want to locate dangerous character sequences when used in XML or HTML context. (strip_tags() is horribly unsafe. As the manual says, it doesn't check for malformed HTML.)

Update

I think I need to clarify that there are a lot people mistaking this question for a question about basic security via "escaping" or "filtering" dangerous characters. This is not that question, and most of the simple answers given wouldn't solve that problem anyway.

Update 2: Example

User submits input
if (mb_strpos($data, '<') === FALSE AND mb_strpos($data, '>') === FALSE)
I save it

Now that the data is in my application I do two things with it - 1) display in a format like HTML - or 2) display inside a format element for editing.

The first one is safe in XML and HTML context

<h2><?php print $input; ?></h2>' <xml><item><?php print $input; ?></item></xml>

The second form is more dangerous, but it should still be safe:

<input value="<?php print htmlspecialchars($input, ENT_QUOTES, 'UTF-8');?>">

Update 3: Working Code

You can download the gist I created and run the code as a text or HTML response to see what I'm talking about. This simple check passes the http://ha.ckers.org XSS Cheat Sheet, and I can't find anything that makes it though. (I'm ignoring Internet Explorer 6 and below).

I started another bounty to award someone that can show a problem with this approach or a weakness in its implementation.

Update 4: Ask a DOM

It's the DOM that we want to protect - so why not just ask it? Timur's answer lead to this:

function not_markup($string)
{
    libxml_use_internal_errors(true);
    if ($xml = simplexml_load_string("<root>$string</root>"))
    {
        return $xml->children()->count() === 0;
    }
}

if (not_markup($_POST['title'])) ...

strip_tags() is horribly unsafe. As the manual says, it doesn't check for *malformed HTML*. — Xeoncross, Dec 07 '11 at 16:46
Thanks @webarto, but like I said, I'm not trying to fix HTML. Just stop the browser from treating the string as markup. — Xeoncross, Dec 07 '11 at 17:33
XSS attacks can also occure in html attributes, css and javascript variables. You need to escape depending on context. See the OWASP XSS cheat sheet. — Erlend, Dec 08 '11 at 17:07
Yes @Erlend, context is important. Inside form elements, or inside attributes though is a different context from inside normal markup tags where something like `" onClick='dothis'"` won't work. I'm just looking for protection for `print '
' . $name . '
';` not `print '';` where the value must be htmlencoded using specialchars. — Xeoncross, Dec 08 '11 at 17:48
Yes. By the way, csrf protection requires completely different mechanisms like token based approaches. — Erlend, Dec 09 '11 at 19:07
can you not just do a `preg_replace()` kind of thing with `<`, `>`, `/` and their html entities? Or maybe I'm misunderstanding what you're needing. — Willem Ellis, Dec 12 '11 at 19:06
@baudday, I'm not trying to *clean the code*. I'm just trying to spot values that might be problematic when used in HTML or XML context. I don't want to filter or remove anything. — Xeoncross, Dec 12 '11 at 19:41

score 11 · Accepted Answer · edited Jun 29 '12 at 08:02

I don't think you need to implement a huge algorithm to check if string has unsafe data - filters and regular expressions do the work. But, if you need a more complex check, maybe this will fit your needs:

<?php
$strings = array();
$strings[] = <<<EOD
    ';alert(String.fromCharCode(88,83,83))//\';alert(String.fromCharCode(88,83,83))//";alert(String.fromCharCode(88,83,83))//\";alert(String.fromCharCode(88,83,83))//--></SCRIPT>">'><SCRIPT>alert(String.fromCharCode(88,83,83))</SCRIPT>
EOD;
$strings[] = <<<EOD
    '';!--"<XSS>=&{()}
EOD;
$strings[] = <<<EOD
    <SCRIPT SRC=http://ha.ckers.org/xss.js></SCRIPT>
EOD;
$strings[] = <<<EOD
    This is a safe text
EOD;
$strings[] = <<<EOD
    <IMG SRC="javascript:alert('XSS');">
EOD;
$strings[] = <<<EOD
    <IMG SRC=javascript:alert('XSS')>
EOD;
$strings[] = <<<EOD
    <IMG SRC=&#106;&#97;&#118;&#97;&#115;&#99;&#114;&#105;&#112;&#116;&#58;&#97;&#108;&#101;&#114;&#116;&#40;&#39;&#88;&#83;&#83;&#39;&#41;>
EOD;
$strings[] = <<<EOD
    perl -e 'print "<IMG SRC=java\0script:alert(\"XSS\")>";' > out
EOD;
$strings[] = <<<EOD
    <SCRIPT/XSS SRC="http://ha.ckers.org/xss.js"></SCRIPT>
EOD;
$strings[] = <<<EOD
    </TITLE><SCRIPT>alert("XSS");</SCRIPT>
EOD;



libxml_use_internal_errors(true);
$sourceXML = '<root><element>value</element></root>';
$sourceXMLDocument = simplexml_load_string($sourceXML);
$sourceCount = $sourceXMLDocument->children()->count();

foreach( $strings as $string ){
    $unsafe = false;
    $XML = '<root><element>'.$string.'</element></root>';
    $XMLDocument = simplexml_load_string($XML);
    if( $XMLDocument===false ){
        $unsafe = true;
    }else{

        $count = $XMLDocument->children()->count();
        if( $count!=$sourceCount ){
            $unsafe = true;
        }
    }

    echo ($unsafe?'Unsafe':'Safe').': <pre>'.htmlspecialchars($string,ENT_QUOTES,'utf-8').'</pre><br />'."\n";
}
?>

+1 Interesting, so you are asking the DOM engine if the value contains children or not. Makes sense, if it *is* HTML then it will be greater than one child. However, this also assumes the PHP DOM parser views the HTML the same as the browser's DOM parser. — Xeoncross, Dec 22 '11 at 16:52
Timur, I think you're my new hero. Let me do some more testing. Rather than trying to guess on *behalf of* the DOM engine - **lets just ask it!** Brilliant! — Xeoncross, Dec 22 '11 at 17:28
Yep. And, if parsing the html leads to error, you`ll know it. But don`t forget that text (such as "value" in my example) is also a node. So, for `value` `$xml->children()->count()` will be `1` — Timur, Dec 23 '11 at 06:53
And, of course, `$xml` in your example is a `` element. If you are parsing only `"$string"`, than the correct check would be `$xml->count()==1`, because `SimpleXMLElement::count()` returns the amount of children elements. — Timur, Dec 23 '11 at 07:09
@Xeoncross: it seems my new answer didn't catch your attention... What about `& is not an &` as the input string? Both your code at update 3 and the new one inspired on Timur's answer at update 4 can't handle it. Have you tried `& is not an &` with update 3 code and saw the result? Have you tried it with update 4 code? Please, try, and please add a comment to my new answer, answering the questions I've made in the paragraph which starts with 'What if the user wants his "username" to be...' I'm really curious, as it sincerely seems to be "a problem with this approach" you asked... — J. Bruni, Dec 24 '11 at 17:16

score 7 · Answer 2 · answered Dec 12 '11 at 21:24

7

In a comment above, you wrote:

Just stop the browser from treating the string as markup.

This is an entirely different problem to the one in the title. The approach in the title is usually wrong. Stripping out tags just mangles input and can lead to data loss. Ever tried to talk about HTML on a blog that strips tags? Frustrating.

The solution that is usually the correct one is to do as you said in your comment - to stop the browser from treating the string as markup. This - literally taken - is not possible. What you do instead is encode the content as HTML.

Consider the following data:

<strong>Test</strong>

Now, you can look at this one of two ways. You can look at it as literal data - a sequence of characters. You can look at it as HTML - markup that includes strongly emphasises text.

If you just dump that out into an HTML document, you are treating it as HTML. You can't treat it as literal data in that context. What you need is HTML that will output the literal data. You need to encode it as HTML.

Your problem is not that you have too much HTML - it's that you have too little. When you output <, you are outputting raw data in an HTML context. You need to convert it to <, which is the HTML representation of that data before outputting it.

PHP offers a few different options for doing this. The most direct is to use htmlspecialchars() to convert it into HTML, and then nl2br() to convert the line breaks into <br> elements.

answered Dec 12 '11 at 21:24

Jim

72,985
14
101
108

1

I would like to clarify that I have very effective HTML filtering which does correctly encode HTML inside larger bodies of text like comment forms or from parsed HTML documents. However, I wish to only check for the existence of characters that pose a threat used in small strings such as those used in form inputs. They should not be encoded, they should be rejected. I also want to use this identifying code when parsing documents to identify if they contain HTML or XML control characters. – Xeoncross Dec 12 '11 at 21:38
I would like to add that `htmlspecialchars()` is perfect for quoting values wrapped by HTML/XML tags - but fails when quoting values injected into attributes of HTML/XML tags. `` is NOT safe while `` is. – Xeoncross Dec 12 '11 at 21:41
1

@Xeoncross - Could you explain why it is not safe? If you write `
'); ?>">Test
`, it will output `
Test
` and that is correctly escaped and therefore not dangerous. – martinstoeckli Dec 12 '11 at 21:57
1

`htmlspecialchars` only handles `&"'<>`. When you have something like `javascript:alert();` you avoid that filter completely. – Xeoncross Dec 12 '11 at 22:00
@Xeoncross, this would only be effective when used in something like `onmouseover` or `href`. And you probably weren't going to feed something input by users directly -even if it's escaped- into these kinds of attributes anyway, I hope? Or more generally, I think security is greatly defined by context. – Halil Özgür Sep 23 '12 at 22:57
@HalilÖzgür, security is defined by context. However, never assume that since something *probably* won't be used improperly that it is safe. – Xeoncross Sep 24 '12 at 14:55
@Xeoncross, yes, but sometimes naive use of over-escaping et al. causes some issues (e.g. double-escaping, unintentional un-escaping etc) :) I think the developer should always be vigilant about the context -unfortunately, hence no amount of safe net can substitute for that :) – Halil Özgür Sep 24 '12 at 15:08

Peter O. · Answer 3 · 2019-04-25T08:23:55.500

If you're just "looking for protection for print '<h3>' . $name . '</h3>'", then yes, at least the second approach is adequate, since it checks whether the value would be interpreted as markup if it weren't escaped. (In this case, the area where $name would appear is element content, and only the characters &, <, and > have special meaning when they appear in element content.) (For href and similar attributes, the check for "JavaScript: " may be necessary, but as you stated in a comment, that isn't a goal.)

For official sources, I can refer to the XML specification:

Content production in section 3.1: Here, content consists of elements, CDATA sections, processing instructions, and comments (which must begin with <), references (which must begin with &), and character data (which contains any other legal character). (Although a leading > is treated as character data in element content, many people usually escape it along with <, and it's better safe than sorry to treat it as special.)
Attribute value production in section 2.3: A valid attribute value consists of either references (which must begin with &) or character data (which contains any other legal character, but not < or the quote symbol used to wrap the attribute value). If you need to place string inputs in attributes in addition to element content, the characters " and ' need to be checked in addition to &, <, and possibly > (and other characters illegal in XML).
Section 2.2: Defines what Unicode code points are legal in XML. In particular, null is illegal in an XML document and may not display properly in HTML.

HTML5 (the latest working draft, which is a work in progress, describes a very elaborate parsing algorithm for HTML documents:

Element content corresponds to the "data state" in the parsing algorithm. Here, the string input should not contain a null character, < (which begins a new tag), or & (which begins a character reference).
Attribute values correspond to the "before attribute value state" in the parsing algorithm. For simplicity, we assume the attribute value is wrapped in double quotation marks. In that case, the parser moves to the "attribute value (double-quoted) state". In this case, the string input should not contain a null character, " (which ends the attribute value), or & (which begins a character reference).

If string inputs are to be placed in attribute values (unless placing them there is solely for display purposes), there are additional considerations to keep in mind. For example, HTML 4 specifies:

User agents should interpret attribute values as follows:

Replace character entities with characters,

Ignore line feeds,

Replace each carriage return or tab with a single space.

User agents may ignore leading and trailing white space in CDATA attribute values[.]

Attribute value normalization is also specified in the XML specification, but apparently not in HTML5.

EDIT (Apr. 25, 2019): Also, be suspicious of inputs containing—

the null code point (as it can cause parse errors in certain places, as specified in the HTML5 specification), or
any code point illegal in XML (as it will cause parse errors upon reading the XML document),

...assuming htmlspecialchars doesn't escape those code points already.

Awesome, someone that understands my question. However, this is just a reiteration of what I already said, so I'm looking for a more *official* source that your and my existing knowledge. — Xeoncross, Dec 16 '11 at 19:40
Thanks Peter, now perhaps someone can also add sources for independent user agents (browsers and phones) as well as HTML 4 & 5. — Xeoncross, Dec 18 '11 at 19:45

score 3 · Answer 4 · edited Jan 11 '14 at 23:51

3

HTML Purifier does a good job and is very easy to implement. You could also use a Zend Framework filter like Zend_Filter_StripTags.

HTML Purifier doesn't just fix HTML.

edited Jan 11 '14 at 23:51

Peter Mortensen

30,738
21
105
131

answered Dec 12 '11 at 08:42

mat

1,619
1
15
25

That might solve the problem of spotting characters that pose a threat in certain contexts, but it is also is a *big waste of memory* just to spot the 2-12 possible character sequences I am looking for. HTMLPurifier is built to filter, not locate dangerous input. – Xeoncross Dec 12 '11 at 19:57

score 2 · Answer 5 · edited Jan 11 '14 at 23:54

2

I think you answered your own question. The function htmlspecialchars() does exactly what you need, but you should not use it until you write the user input to a page. To store it in a database there are other functions, like mysqli_real_escape_string().

As a rule of thumb, one can say that you should escape user input only when needed, for the given target system:

Escaping user input often means a loss of the original data, and different target systems (HTML output / SQL / execution) need different escaping. They can even conflict with each other.
You have to escape the data for the given purpose anyway, always. You should not trust even the entries from your database. So escaping when reading from user input does not have any big advantage, but double escaping can lead to invalid data.

In contrast to escaping, validating the content is a good thing to do early. If you expect an integer, only accept integers, otherwise refuse the user input.

edited Jan 11 '14 at 23:54

Peter Mortensen

30,738
21
105
131

answered Dec 12 '11 at 20:26

martinstoeckli

23,430
6
56
87

mysql-_real_escape-string will not remove the html tags, which was what the question was about – Kneel-Before-ZOD Dec 16 '11 at 16:57
@user705339 - It does not and it should not, that's what i tried to say with this answer. Escape only when needed, for the given purpose. It's ok when HTML tags are stored in the database, this is not dangerous, but when displaying it on an HTML page, then you should use ´htmlspecialchars()´ to make this tags harmless. You should do this even when you think that no tags are stored in the database. As for javascript, normally javascript needs HTML tags to run, one would have to write this user input into an event-attribute so they can be executed. – martinstoeckli Dec 16 '11 at 17:26

score 1 · Answer 6 · edited Jan 11 '14 at 23:57

I am certainly not a security expert, but from what I gather something like your suggested

if (htmlspecialchars($data, ENT_NOQUOTES, 'UTF-8') === $data)

should work to prevent you from passing on contaminated strings, given you got your encoding right there.

XSS attacks that don't require '<' or '>' rely on the string being handled in a JavaScript block right there and then, which, from how I read your question, is not what you are concerned with in this situation.

score 1 · Answer 7 · edited Jan 12 '14 at 00:00

1

I suggest you to take a look at the xss_clean function from CodeIgniter. I know you don't want to clean, sanitize, or filter anything. You just want to "detect bad behaviour" and reject it. That's exactly why I recommend you to look at this function code.

IMO, we can find a deep and strong XSS vulnerability knowledge there, including all the knowledge you want and need with your question.

Then, my short / direct answer to you would be:

if (xss_clean($data) === $data)

Now, you don't need to use the whole CodeIgniter framework just because you need this single function, of course. But I believe you may want to grab the whole CI_Security class (at /system/core/Security.php) and do a few modifications to eliminate other dependencies.

As you will see, xss_clean code is quite complex, as XSS vulnerabilities really are, and I would just trust it and do not try to "reinvent this wheel"... IMHO, you can't get rid of XSS vulnerabilities by merely detecting a dozen of characters.

edited Jan 12 '14 at 00:00

Peter Mortensen

30,738
21
105
131

answered Dec 18 '11 at 00:05

J. Bruni

20,322
12
75
92

Thanks @JBruni, a couple years ago I reported some security vulnerabilities to the CI team about this function. I supposed I should take a look again at that function to see how it's evolved. – Xeoncross Dec 18 '11 at 19:30
Well, again the problem they are tackling is taking text and making it [safe for immediate display](https://github.com/EllisLab/CodeIgniter/blob/develop/system/core/Security.php#L264) with no further processing needed. In the example I stated above, this comes as full-filling both A and B. The problem is that it also removes many things that should be allowed in text given the right context. If you tried to discuss any of the things in XSS it would be removed causing problems. I think filtering for context should be done right before the context - not in a general sense as they do it. – Xeoncross Dec 18 '11 at 19:41
@Xeoncross I confess that when I wrote my answer I had still not grasped exactly what you want(ed)... which is not easy... I mean: the idea of what you want, itself, is easy, but... you know, it is not easy to make this simple idea understandable to others (if it was, all we would not need to wrtie so much trying to reach an agreement)... like trying to explain what a lamp is and how it works, before it existed and worked... [to be continued] – J. Bruni Dec 19 '11 at 00:14
The original question (What is the correct way to detect whether string inputs contain HTML or not?) is tricky... Actually, we can say that `Hello, World` is pure HTML. What about `2 < 5` ... contains HTML or not? If you mean **HTML tags** then, well, it seems your first code option is the correct answer. – J. Bruni Dec 19 '11 at 00:20
Indeed, CI `xss_clean` suffered lots of serious flaws, but they seem to have been corrected in the latest release (2.1.0). And, indeed, the whole discussion, including the question, started to make sense since the **context** was gradually explained... i.e., what you want to do, as **output**, with your input... – J. Bruni Dec 19 '11 at 00:30

J. Bruni · Answer 8 · 2011-12-19T02:20:04.537

The correct way to detect whether string inputs contain HTML tags, or any other markup that has a special meaning in XML or (X)HTML when displayed (other than being an entity) is simply

if (mb_strpos($data, '<') === FALSE AND mb_strpos($data, '>') === FALSE)

You are correct! All XSS and CSFR attacks require < or > around the values to get the browser to execute the code (at least from IE6+).

Considering the output context given, this is sufficient to safely display in a format like HTML:

<h2><?php print $input; ?></h2> <xml><item><?php print $input; ?></item></xml>

Of course, if we have any entity in the input, like á, a browser will not output it as á, but as á, unless we use a function like htmlspecialchars when doing the output. In this case, even the < and > would be also safe.

In the case of using the string input as the value of an attribute, the safety depends on the attribute.

If the attribute is an input value, we must quote it and use a function like htmlspecialchars in order to have the same content back for editing.

<input value="<?php print htmlspecialchars($input, ENT_QUOTES, 'UTF-8');?>">

Again, even the < and > characters would be safe here.

We may conclude that we do not have to do any kind of detection and rejection of the input, if we will always use htmlspecialchars to output it, and our context will fit always the above cases (or equally safe ones).

[And we also have a number of ways to safely store it in the database, preventing SQL exploits.]

What if the user wants his "username" to be & is not an &? It does not contain < nor >... will we detect and reject it? Will we accept it? How will we display it? (This input gives interesting results in the new bounty!)

Finally, if our context expands, and we will use the string input as an anchor href, then our whole approach suddenly changes dramatically. But this scenario is not included in the question.

(It worths mentioning that even using htmlspecialchars the output of a string input may differ if the character encodings are different on each step.)

Hey! The new bounty has changed! I think I win it: `& is not an &` :-) (based on René Magritte's "this is not a pipe") — J. Bruni, Dec 19 '11 at 01:53
You're right, I missed this answer. Anyway, this is a problem. However, it's not a *security problem* - more of a display problem. It's just like when you see strings that were perhaps double-escaped `Hi I&quote;m John`. `&` is for character display and will not interfere with parsing the same as `' " < >`. Sill, as you mention this same thing applies to quotes when used for things like ``. — Xeoncross, Dec 25 '11 at 15:20

score 0 · Answer 9 · answered Dec 07 '11 at 17:13

0

filter_input + FILTER_SANITIZE_STRING (there are lots of flag you can chose from)

:- http://www.php.net/manual/en/filter.filters.sanitize.php

answered Dec 07 '11 at 17:13

ajreal

46,720
11
89
119

Interesting, though the `STRIP_LOW` and `STRIP_HIGH` makes me think this is only good for ASCII. I need to look into this more... – Xeoncross Dec 07 '11 at 17:31

score 0 · Answer 10 · edited Jan 11 '14 at 23:52

0

If the reason of the question is for XSS prevention, there are several ways to explode a XSS vulnerability. A great cheatsheet about this is the XSS Cheatsheet at ha.ckers.org.

But, detection is useless in this case. You only need prevention, and the correct use of htmlspecialchars/htmlentities on your text inputs before saving them to your database is faster and better than detecting bad input.

edited Jan 11 '14 at 23:52

Peter Mortensen

30,738
21
105
131

answered Dec 12 '11 at 20:24

Andrés Torres

747
5
16

3

HTML encoding is an output problem, not an input problem. Doing it before you save the data into the database is the wrong place to do it. This is for output encoding, it should be done after you pull the data out of the database. – Jim Dec 12 '11 at 21:24
@Jim dont you think that that violates the DRY principle? Why clean your output every time you render a page when you can clean your input once? – Greg Guida Dec 23 '11 at 20:16
1

@Greg Because you don't always know how the data will be output. Even if you output only to a webpage, you may need to insert data into a query string, JavaScript, or HTML, each of which has different control characters and escaping protocols. – Max Nanasy Jul 17 '12 at 20:15

score 0 · Answer 11 · answered Dec 15 '11 at 22:30

0

You could use a regular expression if you know the character sets that are allowed. IF a character is in the username that isn't allowed then throw an error:

[a-zA-Z0-9_.-]

Test your regular expressions here: http://www.perlfect.com/articles/regextutor.shtml

<?php
$username = "abcdef";
$pattern = '/[a-zA-Z0-9_.-]/';
preg_match($pattern, $username, $matches);
print_r($matches);
?>

answered Dec 15 '11 at 22:30

Timothy Martens

678
4
17

This doesn't support unicode characters. I think you need to go study regex some more so you can support non-English words. `if(preg_match_all('/\P{L}/u', $text))` would work just fine if I wanted word characters. – Xeoncross Dec 15 '11 at 22:53
The reverse the regex and look for HTML <.*> – Timothy Martens Dec 16 '11 at 18:04
That's a bad idea. They don't need to inject a full tag - just part of a tag in most cases. – Xeoncross Dec 16 '11 at 18:31
just search for the opening < and seperately the > then. – Timothy Martens Dec 16 '11 at 18:37

score -1 · Answer 12 · edited Jan 11 '14 at 23:55

You can make use of the strip_tags function in PHP. This function will strip HTML and PHP tags from given data.

For example, $data is the variable which holds your content then you can use this like this:

if (strlen($data) != strlen(strip_tags($data))){
    return false;
} 
else{
    return true;
}

It will check stripped content against the original content. If both are equal then we can hope there aren't any HTML tags, and it returns true. Otherwise, it returns false as it found some HTML tags.

score -1 · Answer 13 · answered Dec 16 '11 at 17:04

Regex is still the most efficient way of solving your problem. It doesn't matter what frameworks you plan to use, or are advised to use, the most efficient way would still be a custom regex code. You can test the string with a regex, and remove (or convert) the affected section using htmlcharacter function.
No need to install any other framework, or use some long-winded application.

What is the correct way to detect whether string inputs contain HTML or not?

Update

Update 2: Example

Update 3: Working Code

Update 4: Ask a DOM

' . $name . '

13 Answers13

Linked