Technique for reading a link content and parsing it

Question

What is the best practice for implementing a Google+-like or Facebook-like link sharer where when an entity pastes a link on a textarea it fetches the content of the link, gets particularly the title of that page, a sample text and an image?

This question is PHP/jQuery related. Thanks.

Not quite sure what you mean by "an entity pastes a link on a textarea" here. Are you just trying to get the title/description/etc of a URL via PHP? — Nick, Nov 23 '11 at 16:46
yep. `file_get_contents` seems to work fine but do you think that's the best approach to this? — threepoint1416, Nov 23 '11 at 16:51

Roman · Answer 1 · 2011-11-23T17:05:31.250

Without going into any details.

On the clientside you monitor changes on the textbox and look for anything that looks like an url. When found, you send that through an Ajax call to the server.

The server opens the remote url and parses it. Now you can use the parser to look for the page title. (You might want to check the mime type before trying to download/parse some users' linked ".pdf" or ".mov" files...)

The server responds to the ajax call with the wanted details (the page title or an error message).

You need to go through your own server because of security rules on the browser.

More complicated systems would look for semantic meta annotations (like schema.org, microformat or facebook open graph) and interpret those to search for relevant images, descriptions or videos.

score 0 · Answer 2 · answered Nov 23 '11 at 16:56

Have a look into get_meta_tags (specifically the function in the comment this links to - as it'll get the content of the title tag as well).

I'm not sure if this is the best solution, and I know a lot of people are against using regex to parse HTML like this function is doing to get the title tag, but it seems to work pretty well when I've used it.

This can also easily get the og: metadata that Facebook uses (if it's been set on the URL your trying to parse) too.

score 0 · Answer 3 · answered Nov 23 '11 at 18:28

Regardless of how you want to approach this, you have to bypass the same origin policy. Perhaps the easiest approach to this is to just put a simple PHP script on the server to fetch a url and return it.

Depending on where you want to do the work (i.e. what language you feel comfortable in), you can take a client approach to parsing or a server approach.

CLIENT PARSING STRATEGY

(working fiddle)

If you want to do the work in jQuery, your simple PHP script will look something like this:

<?php

   // you could do this with curl too, plenty of tuts on that topic
   $url = $_GET['url']; //todo: sanitize this!
   print file_get_contents($url);

Then you would parse the result client side like so:

jQuery(function($) {
   // given an html response, extract the title
   function getTitle(data) {
       var matches = data.match(/<title>(.+)<\/title>/);
       return matches.length > 1? matches[1] : '';
   }

   // find the body tag of an element
   // because browsers parse the innerHtml differently
   // (http://stackoverflow.com/questions/2488839/does-jquery-strip-some-html-elements-from-a-string-when-using-html)
   // we can't rely on just $(data) to do this right
   function getBody(data) {
      var matches = data.match(/<body>(.+)<\/body>/);
      return $(matches[1]);
   }

   //given an html respones, extract a description
   function getDesc(data) {
       var $data = $(data);
       var $match = $data.find('meta[name=description]');
       if ($match.length) {
           return $match.attr('content');
       }
       var $body = getBody(data);
       return $body.text().substring(0, 255).replace(/\n/, ' ');
   }

   // this url would point to a proxy (PHP) script on your server
   // which would do a curl or similar operation to retrieve the
   // url's contents; we just point to fiddle's simulator here
   $.ajax('/php_fetch_url.php', {
       data: {
           url: 'http://www.somedomain.to/fetch/'
       },
       success: function(data, status, xhr) {
           // assumes your debugger console (e.g. Firebug) is opened!
           console.log(data);
           console.log(status);
           console.log(xhr);

           console.log('title='+getTitle(data));
           console.log('desc='+getDesc(data));
       },
       type: 'GET',
       error: function(xhr, status, err) {
           console.log(status);
           console.log(err);
       },
       dataType: 'text'
   });
});

SERVER PARSING STRATEGY

If you feel more comfy in PHP, or really want to take the most efficient and secure approach, then you can do the work in PHP and return a json object. Your PHP script will look something like this:

<?php

   function fetchContent($url) {
      //todo: sanitize $url!
      return file_get_contents($url);
   }

   function fetchTitle($content) {
      preg_match('@<title>([^<]+)</title>@m', $content, $matches);
      return count($matches) > 1? $matches[1] : '';  
   }

   function fetchBody($content) {
      return preg_replace('@.*<body>(.*)</body>.*@m', "\\1", $content);
   }

   function fetchDesc($content) {
      preg_match('@<meta[\s\n+]name=[\'"]description[\'"][\s\n]+content=[\'"]([^'"]+)[\'"]@m', $content, $matches);
      if( count($matches) > 1 ) { return $matches[1]; }
      $body = fetchBody($content);
   }

   $content = fetchContent($_GET['url']);

   // you may need to install json
   // http://us.php.net/json
   print json_encode( array("title" => fetchTitle($content), "description" => fetchDesc($content))) );

And then your js code will look something like this:

jQuery(function($) {
   $.ajax('/php_fetch_url.php', {
       // A CRUCIAL CHANGE!
       dataType: 'json',
       data: {
           url: 'http://www.somedomain.to/fetch/'
       },
       success: function(data, status, xhr) {
           // assumes your debugger console (e.g. Firebug) is opened!
           console.log('title='+data.title);
           console.log('desc='+data.description);
       },
       type: 'GET',
       error: function(xhr, status, err) {
           console.log(status);
           console.log(err);
       }
   });
});

Technique for reading a link content and parsing it

3 Answers3