Problems on getting a page's title with PHP

Question

I did this function in PHP to get a page's title. I know it might look a bit messy, but that's because I'm a beginner in PHP. I have used preg_match("/<title>(.+)<\/title>/i",$returned_content,$m) inside the if before and it hasn't worked as I expected.

function get_page_title($url) {
    $returned_content = get_url_contents($url);
    $returned_content = str_replace("\n", "", $returned_content);
    $returned_content = str_replace("\r", "", $returned_content);
    $lower_rc = strtolower($returned_content);
    $pos1 = strpos($lower_rc, "<title>") + strlen("<title>");
    $pos2 = strpos($lower_rc, "</title>");
    if ($pos2 > $pos1)
        return substr($returned_content, $pos1, $pos2-$pos1);
    else
        return $url;
}

This is what I get when I try to get the titles of the following pages using the function above: http://www.google.com -> "302 Moved" http://www.facebook.com -> ""http://www.facebook.com" http://www.revistabula.com/posts/listas/100-links-para-clicar-antes-de-morrer -> "http://www.revistabula.com/posts/listas/100-links-para-clicar-antes-de-morrer" (When I add a / to the end of the link, I can get the title successfully: "100 links para clicar antes de morrer | Revista Bula")

My questions are: - I know google is redirecting to my country's mirror when i try to access google.com, but how can I get the title of the page it redirects to? - What is wrong in my function that makes it get the title of some pages, but not of others?

I have already accepted an answer. `get_url_contents()` returns the page html code. — AntonioJunior, Aug 11 '11 at 23:41

score 5 · Accepted Answer · edited May 23 '17 at 09:59

HTTP clients should follow redirects. That 302 status code means that the content you tried to get isn't at that location, and the client should follow the Location: header to figure out where it is.

You have two problems here. The first is not following redirects. If you use cURL, you can get it to follow redirects by setting this:

curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);

See this question for a full solution:

Make curl follow redirects?

The second problem is that you are parsing HTML with RegEx. Don't do that. See this question for better alternatives:

How do you parse and process HTML/XML in PHP?

Perfect answer! In the link you passed, "Best methods to parse HTML with PHP", I found "Simple HTML Dom Parser", and it solves my problem. — AntonioJunior, Aug 11 '11 at 23:19

score 0 · Answer 2 · answered Aug 11 '11 at 23:14

0

Why not try something like this?? Works very well.

function get_page_title($url) 
{
        $source = file_get_contents($url);

        $results = preg_match("/<title>(.*)<\/title>/", $source, $title_matches);
        if (!$results) 
            return null; 

        //get the first match, this is the title 
        $title = $title_matches[1];
        return $title;
}

answered Aug 11 '11 at 23:14

user783437

227
1
9

2

Parse HTML with a regex? [We don't do that here](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – George Cummins Aug 11 '11 at 23:16
It's certainly possible and I find it works well for some situations. I've been working with PHP for years and don't know how to use a DOM Parser. Parsing HTML with regex seems to be a good alternative for a beginner. – user783437 Aug 11 '11 at 23:19
1

You are right: parsing HTML with a regex will work in _some_ situations. However, good code should handle all valid input, and a regular expression cannot do that when HTML is the input. Your regex will fail to return the expected results on this ugly but completely valid snippet of HTML: `Site X<![CDATA[]]>`. – George Cummins Aug 11 '11 at 23:30

Problems on getting a page's title with PHP

2 Answers2