7

Which function in php validate if the string is html? My target to take input from user and check if input html and not just string.

Example for not html string:

sdkjshdk<div>jd</h3>ivdfadfsdf or sdkjshdkivdfadfsdf

Example for html string:

<div>sdfsdfsdf<label>dghdhdgh</label> fdsgfgdfgfd</div>

Thanks

Ben
  • 25,389
  • 34
  • 109
  • 165
  • Both of those strings are snippets of HTML. The former happens to be obviously invalid, but neither would pass the W3C validator without modification. I think you need to be a bit more specific about what you want to allow, and what you want to prevent. – Annika Backstrom Jul 02 '10 at 15:36
  • My target to take input from user and check if input html and not just string. – Ben Jul 02 '10 at 15:39

7 Answers7

11

Maybe you need to check if the string is well formed.

I would use a function like this

function check($string) {
  $start =strpos($string, '<');
  $end  =strrpos($string, '>',$start);

  $len=strlen($string);

  if ($end !== false) {
    $string = substr($string, $start);
  } else {
    $string = substr($string, $start, $len-$start);
  }
  libxml_use_internal_errors(true);
  libxml_clear_errors();
  $xml = simplexml_load_string($string);
  return count(libxml_get_errors())==0;
}

Just a warning: html permits unbalanced string like the following one. It is not an xml valid chunk but it is a legal html chunk

<ul><li>Hi<li> I'm another li</li></ul>

Disclaimer I've modified the code (without testing it). in order to detect well formed html inside the string.

A last though Maybe you should use strip_tags to control user input (As I've seen in your comments)

Dimitrios Desyllas
  • 9,082
  • 15
  • 74
  • 164
Eineki
  • 14,773
  • 6
  • 50
  • 59
  • 1
    this approach fails with   - with no obvious work around :-( – ErichBSchulz Jul 15 '13 at 09:29
  • @ErichBSchulz Maybe you have just to html_entity_decode($string) before testing it (quick and dirty solution but it should be sufficient) – Eineki Jul 16 '13 at 07:21
  • html_entity_decode() won't do it, for example because it would change < to a literal less-than, which would at least have the wrong meaning, and most likely be non-well-formed. – TextGeek Jun 01 '16 at 20:37
  • `

    FooBar` is valid HTML (even without closing `p`!), but this method will report errors.

    – Stephan Vierkant Feb 09 '22 at 08:17
5

You can use DomDocument's method loadHTML

a1ex07
  • 36,826
  • 12
  • 90
  • 103
3

simplexml_load_string will fail if you don't have a single root node. So if you try this html:

<p>A</p><p>B</p> it will be invalid.

Here's my function:

function check($string){
    $start = strpos($string, '<');
    $end = strrpos($string, '>', $start);

    if ($end !== false) {
        $string = substr($string, $start);
    } else {
        $string = substr($string, $start, strlen($string) - $start);
    }

    // xml requires one root node
    $string = "<div>$string</div>";

    libxml_use_internal_errors(true);
    libxml_clear_errors();
    simplexml_load_string($string);

    return count(libxml_get_errors()) == 0;
}
Diogo Gomes
  • 2,135
  • 16
  • 13
2

Do you mean HTML or XHTML?

The HTML standard and interpretation are so loose that your first snippet might work. It won't be pretty but you might get something.

XHTML is quite a bit more strict and at minimum will expect your snippet to be well-formed (all opened tags are closed; tags can nest but not overlap) and may throw warnings if you have unrecognized elements or attributes.

Something like Tidy - http://php.net/manual/en/book.tidy.php - is probably a good start. Once you load your snippet using that, you can use tidy_error_count or tidy_get_error_buffer to see if it's "okay enough" for your needs.

CaseySoftware
  • 3,105
  • 20
  • 18
  • My target to take input from user and check if input html and not just string. – Ben Jul 02 '10 at 15:39
  • Ok. And both are HTML... the HTML spec is *so* loose that it almost doesn't matter. What the second is additionally is XHTML. If that's what you're looking for, explore Tidy and see what you can do. – CaseySoftware Jul 03 '10 at 12:38
  • If you're in a situation where third-party applications are acceptable, tidy may be acceptable. However, its dependence on multiple third-party products rules it out for use in a library. – Jay Bienvenu Aug 30 '18 at 22:12
2

Are you trying to prevent users from posting html tags instead of strings? Cause if this is what you want to do you just need striptags()

Wich will remove any html tags from the string.

Iznogood
  • 12,447
  • 3
  • 26
  • 44
2

you should use:

$html="<html><body><p>This is array.</p><br></body></html>";

libxml_use_internal_errors(true);
$dom = New DOMDocument();
$dom->loadHTML($html);
if (empty(libxml_get_errors())) {
  echo "This is a good HTML";
}else {
  echo "This not html";
}
  • Valentin Caamal this code fail using $html = $html.'
    '; logically this NO'T is a correct (x)HTML. Yet using `$dom -> loadXML($html);` with `$html = '

    k

    ';` this function fail. (note ``loadXML``).
    – Stackoverflow Apr 10 '21 at 01:39
0

If you want to make your site secure also, you certainly have to use an HTML purifier like htmlpurifier, tidy etc.

Ali Asgari
  • 801
  • 7
  • 18