0

Here is a problem:

I want to extract title of website. I have seen multiple implementation but none of them handled sites with multiple <title> tags. So currently i'm using something like this to extract first (true) title:

function GetTitleFromWebSite($url)
{
    $arrContextOptions=array(
        "ssl"=>array(
            "verify_peer"=>false,
            "verify_peer_name"=>false,
        ),
    );  

    $page = @file_get_contents($url, false, stream_context_create($arrContextOptions));
    if ( $page )
    {
        $title_begin = strpos($page, "<title>");
        if ( $title_begin )
        {
            $title_end = strpos( $page, "</title>" );
            if ( $title_end )
            {
                $title_begin += 7;
                $title = htmlentities( substr($page, $title_begin, $title_end - $title_begin) );

                return $title;
            }
        }
    }

    return "";
}

I know that this isn't secure, but this is only for test and i will worry about certifications later.

Question is:

What is the best way of handling this? Something that will take care of every crazy construction? Some of the implementations handled new line in <title>. Is there any 'nice' way of doing this?

Derag
  • 126
  • 1
  • 7
  • no web page should have multiple `title` tags - at least to do so renders them invalid. However, use `DOMDocument` to load the page and `getElementsByTagName` - then iterate through the collection – Professor Abronsius Oct 02 '18 at 20:35
  • But is there the way to skip certification in `DOMDocument`? – Derag Oct 02 '18 at 20:44
  • 1
    assuming that `$page = @file_get_contents.....` returns the html then what need is there to `skip certification`? Load the html into the dom object and use that to process the titles.... – Professor Abronsius Oct 02 '18 at 20:46
  • Ah, you are right. Thanks – Derag Oct 02 '18 at 20:55

1 Answers1

0

Not tested and based upon the assumption that you can actually capture the html from the remote url then perhaps the following might lead you to a solution

function GetTitleFromWebSite( $url ){
    $opts=array(
        'ssl'   =>  array(
            'verify_peer'       =>  false,
            'verify_peer_name'  =>  false,
        ),
    );

    $titles=array();

    $page = @file_get_contents($url, false, stream_context_create($opts));
    if ( $page ) {

        libxml_use_internal_errors( true );
        $dom=new DOMDocument;
        $dom->validateOnParse=false;
        $dom->standalone=true;
        $dom->preserveWhiteSpace=true;
        $dom->strictErrorChecking=false;
        $dom->recover=true;

        $dom->loadHTML( $page );
        libxml_clear_errors();


        $col=$dom->getElementsByTagName( 'title' );
        if( $col->length > 0 ){
            foreach( $col as $title ) $titles[]=$node->nodeValue;
        }
        return $titles;
    }
    return "";
}
Professor Abronsius
  • 33,063
  • 5
  • 32
  • 46