0

I'm trying to get the title of a website that is entered by the user.

Text input: website link, entered by user is sent to the server via AJAX. The user can input anything: an actual existing link, or just single word, or something weird like 'po392#*@8'

Here is a part of my PHP script:

         // Make sure the url is on another host
        if(substr($url, 0, 7) !== "http://" AND substr($url, 0, 8) !== "https://") {
            $url = "http://".$url;
        }

        // Extra confirmation for security
        if (filter_var($url, FILTER_VALIDATE_URL, FILTER_FLAG_HOST_REQUIRED)) {
            $urlIsValid = "1";
        } else {
            $urlIsValid = "0";
        }

        // Make sure there is a dot in the url
        if (strpos($url, '.') !== false) {
            $urlIsValid = "1";
        } else {
            $urlIsValid = "0";
        }

        // Retrieve title if no title is entered
        if($title == "" AND $urlIsValid == "1") {

            function get_http_response_code($theURL) {
                $headers = get_headers($theURL);
                if($headers) {
                    return substr($headers[0], 9, 3);
                } else {
                    return 'error';
                }
            }

            if(get_http_response_code($url) != "200") {

                $urlIsValid = "0";

            } else {

                $file = file_get_contents($url);

                $res = preg_match("/<title>(.*)<\/title>/siU", $file, $title_matches);

                if($res === 1) {
                    $title = preg_replace('/\s+/', ' ', $title_matches[1]);
                    $title = trim($title);

                    $title = addslashes($title);
                }

                // If title is still empty, make title the url
                if($title == "") {
                    $title = $url;
                }

            }
        }

However, there are still errors occuring in this script.

It works perfectly if an existing url as 'https://www.youtube.com/watch?v=eB1HfI-nIRg' is entered and when a non-existing page is entered as 'https://www.youtube.com/watch?v=NON-EXISTING', but it doesn't work when the users enters something like 'twitter.com' (without http) or something like 'yikes'.

I tried literally everthing: cUrl, DomDocument...

The problem is that when an invalid link is entered, the ajax call never completes (it keeps loading), while it should $urlIsValid = "0" whenever an error occurs.

I hope someone can help you - it's appreciated.

Nathan

Nathan
  • 491
  • 4
  • 8
  • 3
    anything against `true` and `false` ? – Pedro Lobito Apr 26 '17 at 18:32
  • maybe `preg_match` "screams" when `$file` is `false`, shows an warning, the (possible) ajax response is not JSON anymore then JS error and the loading is not stopped any more? – Constantin Galbenu Apr 26 '17 at 18:38
  • @PedroLobito I prefer to return strings in ajax calls, but yeah you could just read the '0' as false and the '1' as true. I'm learning. – Nathan Apr 26 '17 at 18:40
  • @ConstantinGALBENU Awesome! That fixed some cases. However now the problem is that as you can see in the code I add 'HTTP://' if the transfer protocol is missing. But for example twitter.com is on HTTPS://, and now it only works for HTTP:// links and not for HTTPS:// links. If I enter twitter.com it doesn't work, but it does work on for example [link](http://www.webopedia.com/TERM/H/HTTP.html). – Nathan Apr 26 '17 at 18:48

1 Answers1

0

You have a relatively simple problem but your solution is too complex and also buggy.

These are the problems that I've identified with your code:

// Make sure the url is on another host
if(substr($url, 0, 7) !== "http://" AND substr($url, 0, 8) !== "https://") {
     $url = "http://".$url;
}

You won't make sure that that possible url is on another host that way (it could be localhost). You should remove this code.

// Make sure there is a dot in the url
if (strpos($url, '.') !== false) {
        $urlIsValid = "1";
} else {
        $urlIsValid = "0";
}

This code overwrites the code above it, where you validate that the string is indeed a valid URL, so remove it.

The definition of the additional function get_http_response_code is pointless. You could use only file_get_contents to get the HTML of the remote page and check it against false to detect the error.

Also, from your code I conclude that, if the (external to context) variable $title is empty then you won't execute any external fetch so why not check it first?

To sum it up, your code should look something like this:

if('' === $title && filter_var($url, FILTER_VALIDATE_URL))
{
    //@ means we suppress warnings as we won't need them
    //this could be done with error_reporting(0) or similar side-effect method
    $html = getContentsFromUrl($url);

    if(false !== $html && preg_match("/<title>(.*)<\/title>/siU", $file, $title_matches))
    {
        $title = preg_replace('/\s+/', ' ', $title_matches[1]);
        $title = trim($title);
        $title = addslashes($title);
    }

    // If title is still empty, make title the url
    if($title == "") {
        $title = $url;
    }
}

function getContentsFromUrl($url)
{
   //if not full/complete url
   if(!preg_match('#^https?://#ims', $url))
   {
       $completeUrl = 'http://' . $url;
       $result = @file_get_contents($completeUrl);
       if(false !== $result)
       {
           return $result;
       }

       //we try with https://
       $url = 'https://' . $url;
   }

   return @file_get_contents($url);
}
Constantin Galbenu
  • 16,951
  • 3
  • 38
  • 54
  • Thanks! I tried that before but I kept trying other things and this was what I ended up with. It still doesn't work if you enter `twitter.com` because Twitter is on `https://` (and with `http://twitter.com`, file_get_contents will fail). Can you help me with that? Also see my other comment :-) ... oh and you probably forgot PHP uses `AND` instead of `&&` – Nathan Apr 26 '17 at 19:20
  • @Nathan PHP uses both `AND` and `&&` but they have a slight different meaning, see http://stackoverflow.com/questions/4502092/php-and-or-keywords – Constantin Galbenu Apr 26 '17 at 19:29
  • I guess You may use php cUrl library if Twitter validates HTTP headers/user-agents – ad4s Apr 26 '17 at 19:47
  • @Constantin Thank you for updating your answer and I learned something new! (Now I'm wondering if it's bad that I only used AND/OR in my scripts) – Nathan Apr 26 '17 at 19:50
  • @ConstantinGALBENU But back to my question... I tried it, but it still doesn't work. If no title is entered, it just keeps loading. Can you check out my script? http://codepad.org/ufEBd2KI – Nathan Apr 26 '17 at 19:51
  • Use developer tools to see what you get from the server and any other JavaScript errors – Constantin Galbenu Apr 26 '17 at 19:54
  • @ConstantinGALBENU I'm not getting JS errors, I already checked that. And the script works if no title is entered. What dev tool would you recommend to see PHP errors? – Nathan Apr 26 '17 at 19:56
  • But what contents do the client receive from the server, is it valid json? See the `network` tab in Dev tools – Constantin Galbenu Apr 26 '17 at 20:00
  • Well, you also learned me something about the Dev Tools! Man I wish I figured that out earlier! There was an error... Here: `if(false !== $html && preg_match("/(.*)<\/title>/siU", $file, $title_matches))` `$file` had to be `$html`... But it's all working now! Thank you so much! – Nathan Apr 26 '17 at 20:10
  • Glad to help. Learning is the most important thing for us, the devs! – Constantin Galbenu Apr 26 '17 at 20:11
  • @ConstantinGALBENU Eh... this isn't [english.stackexchange.com](https://english.stackexchange.com) ;-) Just kidding, have a nice day!! – Nathan Apr 26 '17 at 20:35
  • You too! Here is 23:36 though :) – Constantin Galbenu Apr 26 '17 at 20:37