1

My Html code is like this

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

or this can be like this

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN">

I want to get the Doc Type which will be like "XHTML 1.0 Strict" (for the first one), and "HTML 4.0" (for the second one) from it. What will be the regular expression code for this? I like to use it in PHP preg_match() function.

Please help me in this case.

Razin223
  • 50
  • 1
  • 9

7 Answers7

3

If the doctypes will be in the form shown, you could use

'#(?<=<!DOCTYPE HTML PUBLIC "-//W3C//DTD )[^/]+#i'

So

preg_match('#(?<=<!DOCTYPE HTML PUBLIC "-//W3C//DTD )[^/]+#i', html, $match);  
echo $match[0];
MikeM
  • 13,156
  • 2
  • 34
  • 47
3

How about using DOMDocument and DOMDocumentType?

$xml = new DOMDocument(); 
$xml->loadHTMLFile($url);

$name = $xml->doctype->publicId; // -//W3C//DTD XHTML 1.0 Strict//EN

$doctype now contains following values:

DOMDocumentType Object
(
    [name] => html
    [entities] => (object value omitted)
    [notations] => (object value omitted)
    [publicId] => -//W3C//DTD XHTML 1.0 Strict//EN
    [systemId] => http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
    [internalSubset] => 
    [nodeName] => html
    [nodeValue] => 
    [nodeType] => 10
    [parentNode] => (object value omitted)
    [childNodes] => 
    [firstChild] => 
    [lastChild] => 
    [previousSibling] => 
    [nextSibling] => (object value omitted)
    [attributes] => 
    [ownerDocument] => (object value omitted)
    [namespaceURI] => 
    [prefix] => 
    [localName] => 
    [baseURI] => 
    [textContent] => 
)

So you can now easily extract type:

$name = $xml->doctype->publicId;
$name = preg_replace('~.*//DTD(.*?)//.*~', '$1', $name);
echo $name;

Which will result into XHTML 1.0 Strict. Working phpfiddle example here.

Vyktor
  • 20,559
  • 6
  • 64
  • 96
1
function contains($haystack, $needle){
    if (strpos($haystack,$needle) !== false) {
        return true;
    }else{
        return false;
    }
}
                $theDocType = "";
                $stringWithHTML = ""; // load some HTML in here from somewhere

                // Create DOM from HTML 
                $doc = new DOMDocument();
                //@$doc->loadHTMLFile("just_a_file.html");
                @$doc->loadHTML($stringWithHTML);

                // Grab document type
                $dtName = $doc->doctype->name;
                $dtPublic = $doc->doctype->publicId;
                if( $dtName="html" && $dtPublic!=""){           
                    // HTML or XHTML?
                    if(contains($dtPublic,"xhtml")){
                        $theDocType = "XHTML 1.0";
                    }else{
                        $theDocType = "HTML 4.01";
                    }
                    // Which type?
                    if(contains($dtPublic,"strict")){
                        $theDocType .= " (Strict)";
                    }elseif(contains($dtPublic,"transitional")){
                        $theDocType .= " (Transitional)";
                    }elseif(contains($dtPublic,"frameset")){
                        $theDocType .= " (Frameset)";
                    }else{
                        $theDocType = "XHTML 1.1"; // XHTML 1.1
                    }
                }else{
                    $theDocType = "HTML 5";
                }

                // Result
                echo $theDocType;

This will output things like:
XHTML 1.1
HTML 5
HTML 4.01 (Strict)

JaredNinja
  • 79
  • 10
0

Try this

<?php
   $html = file_get_contents("http://google.com");
   $html = str_replace("\n","",$html);
   $get_doctype = preg_match_all("/(<!DOCTYPE.+\">)<html/i",$html,$matches);
   $doctype = $matches[1][0];
?>
Bogdan Burym
  • 5,482
  • 2
  • 27
  • 46
0
'<!doctype.*?//dtd\s+([^/]*)//EN.*?dtd">'

That should work as pattern for your examples.

TheNiceGuy
  • 3,462
  • 8
  • 34
  • 64
0

This regular expression extracts everything between "DTD " and "/" without any syntax checking:

.*DTD\s+([^/]+)

This regular expression extracts the document type and checks some syntax in the string:

<!DOCTYPE\s+\w*\s*\w*\s*"[-//\w\d]*DTD\s+([\w\d\s.]*)[^"]*[^>]*>
CLSheppard
  • 525
  • 1
  • 6
  • 13
0

I used this thread in the past, but in testing process, I detected a problem with some large doctypes. Some times, the developer split the doctype in 2 or 3 lines. In this case, the usage of regular expression is no the best approach.

I paste an approach for doctypes in one or several lines:

<?
class Doctype {
    var $html;
    var $doctype;
    var $version;
    function Doctype($html){
       $this->html = $html;
       $this->extractDoctype();
       $this->processDoctype();
    }
    private function extractDoctype(){
        $preDoctype = "";
        $preDoctypeValid = false;
        $lines = explode(PHP_EOL, $this->html);
        foreach ($lines as &$line) {
            $preDoctype = $preDoctype . $line;
            if(
                (strpos(strtolower($preDoctype), "<!doctype") !== false) && 
                (strpos(strtolower($preDoctype), ">") !== false)){
                $preDoctypeValid = true;
                break;
            }
        }
        if($preDoctypeValid){
            //Store only the pattern: <! doctype >
            $pos1 = strpos(strtolower($preDoctype), "<!doctype");
            $pos2 = strpos($preDoctype, ">", $pos1) + 1;
            $preDoctype = substr($preDoctype, $pos1, $pos2);            
        }else{
            $preDoctype = "";
        }
        $this->doctype = $preDoctype;
    }
    private function processDoctype(){
        $version = "";

        $pattern_html5 = "/<!doctype\s+?html\s?>/i";
        if (preg_match($pattern_html5, strtolower($this->doctype))) {
            $version = "HTML5";
        }else if(strpos(strtolower($this->doctype), "xhtml") !== false){
            $version = "XHTML";     
        }else if(strpos(strtolower($this->doctype), "html") !== false){
            if(strpos(strtolower($this->doctype), "3.2") !== false){
                $version = "HTML 3.2";  
            }
            if(strpos(strtolower($this->doctype), "4.01") !== false){
                $version = "HTML 4.01"; 
            }
            if(strpos(strtolower($this->doctype), "2.0") !== false){
                $version = "HTML 2.0";  
            }
        }else{
            $version = "OTHER";
        }
        $this->version = $version;
    }
    public function getDoctype(){
        return $this->doctype;
    }
    public function getDoctypeVersion(){
        return $this->version;
    }
}
?>

https://github.com/jabrena/WTAnalyzer/blob/master/r_php/document/Doctype.class.php

jabrena
  • 1,166
  • 3
  • 11
  • 25