17

I am just looking into using HTML Purifier to ensure that a user-inputed string (that represents the name of a person) is sanitized.

I do not want to allow any html tags, script, markup etc - I just want the alpha, numeric and normal punctuation characters.

The sheer number of options available for HTML Purifier is daunting and, as far as i can see, the docs do not seem to have a beggining/middle or end

see: http://htmlpurifier.org/docs

Is there a simple hello world tutorial online for HTML Purifier that shows how to sanitize a string removing all the bad stuff out of it.

I am also considering just using strip tags:

or PHP's in built data sanitizing

JW.
  • 4,821
  • 5
  • 43
  • 60
  • 3
    I'd say go for the simple `strip_tags()` for a trivial task like this :) Pros: Easy to implement, easy to understand, easy to replace (whenever the requirements change). Cons: ? – jensgram Apr 20 '10 at 18:20
  • I second what jensgram says. This is a task for `strip_tags()` and `htmlentities()` - should be enough to thwart any attack. – Pekka Apr 20 '10 at 18:23
  • 1
    yes - i'd love to use strip tags but i read that "striptags() is fundamentally flawed and should not be used." - http://htmlpurifier.org/comparison#striptags - yet i am not sure how up-to-date that is or how relevant it is to its 'blanket usage' of removing all tags – JW. Apr 20 '10 at 18:44
  • @JW "Removes foreign tags: Buggy" worries me a little. But "well-formed", "nesting", and "attributes" are safe to ignore in your case. – jensgram Apr 21 '10 at 05:38
  • HTML Purifier is a wonderful tool **for HTML**. Using it on a non-HTML text-string is not great. It'll do some things for you, but it's not really what you want. – TRiG Jul 05 '11 at 13:09
  • No one answered the real question: Is there a simple hello world tutorial online for HTML Purifier that shows how to sanitize a string removing all the bad stuff out of it. :( – Denis Nikolaenko Jul 28 '11 at 19:46
  • Ha ha! Yes. This question has been open for a while. I guess the answer is no. Add it as an answer and you might win points. – JW. Jul 29 '11 at 09:18

10 Answers10

10

I've been using HTMLPurifier for sanitizing the output of a rich text editor, and ended up with:

include_once('htmlpurifier/library/HTMLPurifier.auto.php');

$config = HTMLPurifier_Config::createDefault();
$config->set('Core', 'Encoding', 'UTF-8');
$config->set('HTML', 'Doctype', 'HTML 4.01 Transitional');

if (defined('PURIFIER_CACHE')) {
    $config->set('Cache', 'SerializerPath', PURIFIER_CACHE);
} else {
    # Disable the cache entirely
    $config->set('Cache', 'DefinitionImpl', null);
}

# Help out the Purifier a bit, until it develops this functionality
while (($cleaner = preg_replace('!<(em|strong)>(\s*)</\1>!', '$2', $input)) != $input) {
    $input = $cleaner;
}

$filter = new HTMLPurifier($config);
$output = $filter->purify($input);

The main points of interest:

  1. Include the autoloader.
  2. Create an instance of HTMLPurifier_Config as $config.
  3. Set configuration settings as needed, with $config->set().
  4. Create an instance of HTMLPurifier, passing $config to it.
  5. Use $filter->purify() on your input.

However, it's entirely overkill for something that doesn't need to allow any HTML in the output.

eswald
  • 8,368
  • 4
  • 28
  • 28
0

You should do input validation based on the content - for example rather use some regexp for name

'/([A-Z][a-z]+[ ]?)+/' //ascii only, but not problematic to extend

this validation should do the job well. And then escape the output when printing it on page, with preferred htmlspecialchars.

Mikulas Dite
  • 7,790
  • 9
  • 59
  • 99
0

You can use someting like htmlspecialchars() to preserve the characters the user typed in without the browser interpreting.

NeuroScr
  • 322
  • 1
  • 7
0

I've always thought Codeigniter's xss cleaning class was quite good, but more recently I've turned to Kohana.

Have a look at their xss_clean method

http://github.com/kohana/core/blob/c443c44922ef13421f4a3af5b414e19091bbdce9/classes/kohana/security.php

Andrei Serdeliuc ॐ
  • 5,828
  • 5
  • 39
  • 66
0

HTMLpurifier in action. You can opt to write <?php echo "HELLO";?> in fname and WORLD in lname and check the output.

<?php
include( 'htmlpurifier/htmlpurifier/library/HTMLPurifier.auto.php');
?>
<form method="post">
<input type="text" name="fname" placeholder="first name"><br>
<input type="text" name="lname" placeholder="last name"><br>
<input type="submit" name="submit" value="submit">
</form>
        
<?php
if(isset($_POST['submit']))
{
    $fname=$_POST['fname'];
    $lname=$_POST['lname'];
    
    $config = HTMLPurifier_Config::createDefault();
    $purifier = new HTMLPurifier($config);
    $fname = $purifier->purify($fname);
    
    $config = HTMLPurifier_Config::createDefault();
    $purifier = new HTMLPurifier($config);
    $lname = $purifier->purify($lname);

    echo "First name is: ".$fname."<br>";
    echo "Last name is: ".$lname;
}
Don'tDownvoteMe
  • 501
  • 2
  • 16
-1

The easiest way to remove all non-alphanumeric characters from a string i think is to use RegEx.Replace() as follows:

Regex.Replace(stringToCleanUp, "[\W]", "");

While \w (lowercase) matches any ‘word’ character, equivalent to [a-zA-Z0-9_] \W matches any ‘non-word’ character, ie. anything NOT matched by \w. The code above will use \W (uppercase) and replace the findings with nothing.

As an alternative if you don’t want to allow the underscore you can use [^a-zA-Z0-9], like this:

Regex.Replace(stringToCleanUp, "[^a-zA-Z0-9]", "");

omadmedia
  • 191
  • 1
  • 4
  • Thanks for these Mikulas Dite and omadmedia. I will probably add some regi into the mix. However i still would like to know if there is a hello world tutorial for HTML Purifier. I guess someone would have pointed to one by now if there was. – JW. Apr 27 '10 at 15:46
-1

If you are trying to evade code injection attacks, just scape the data and store and print it like the user entered.

For example: If you want to avoid SQL Injection problems in MySQL, use the mysql_real_escape_string() function or similar to sanitize the SQL sentence. *

Another example: Writing data to a HTML document, parse the data with html_entities(), so the data will appears like enter by the user.

fjfnaranjo
  • 160
  • 6
  • thanks. but no, it not quite what i wanted. the main thing i am looking for is to strip all markup and scripts from user input leaving alpha, numeric and grammatical characters. allow '<' , disallow ''. allow '>' disallow '' etc – JW. Apr 29 '10 at 23:57
-1

For simplicity, you can either use strip_tags(), or replace occurrences of <, >, and & with &lt;, &gt;, and &amp;, respectively. This definitely isn't the best solution, but the quickest.

Propeng
  • 508
  • 2
  • 10
-2

i usually clean all user input before sending to my database with the following

mysql_reql_escape_string( htmlentities( strip_tags($str) ));
David Morrow
  • 8,965
  • 4
  • 29
  • 24
  • Why? `mysql_real_escape_string()` makes sense, and if you have GPC magic quotes enabled you may need to do `striptags()`, but why the `htmlentities()`? – TRiG Jan 19 '12 at 15:51
  • so that when you show the value from the db in a browser its valid html – David Morrow Jan 19 '12 at 16:57
  • 1
    I'd do that when I'm outputting it, not when saving it. That way the database stores real data. Makes sense to me, anyway. *shrug* – TRiG Jan 19 '12 at 17:38
  • personal preference, not really worth a -1 IMO – David Morrow Jan 19 '12 at 20:22
-2

Found this a week ago... LOVE it.

"A simple PHP HTML DOM parser written in PHP5+, supports invalid HTML, and provides a very easy way to handle HTML elements." http://simplehtmldom.sourceforge.net/

// Example
$html = str_get_html("<div>foo <b>bar</b></div>");
$e = $html->find("div", 0);

echo $e->tag; // Returns: " div"
echo $e->outertext; // Returns: " <div>foo <b>bar</b></div>"
echo $e->innertext; // Returns: " foo <b>bar</b>"
echo $e->plaintext; // Returns: " foo bar"

You can also loop through and remove individual tags, etc. The docs and examples are pretty good... I found it easy to use in quite a few places. :-)

  • This seems like it might fit the bill. I hadn't considered using a DOM parser for this - but it makes sense. Thanks for the tip. – JW. May 27 '10 at 14:56
  • 5
    Pedantic Note: this has nothing to do with security or sanitization. SimpleHTMLDom is just for working the elements in an object-oriented manner. -1 – ircmaxell Mar 30 '11 at 09:46