Get information from a web page (title, pictures, heads, etc...)

Question

In Facebook, when you add a link to your wall, it gets the title, pictures and part of the text. I've seen this behavior in other websites where you can add links, how does it work? does it has a name? Is there any javascript/jQuery extension that implements it?

And how is possible that facebook goes to another website and gets the html when it's, supposedly, forbidden to make a cross site ajax call ??

Thanks.

may be this will helpfull info\: http://stackoverflow.com/questions/680562/can-javascript-read-the-source-of-any-web-page — Musa, Apr 21 '16 at 06:42

Tom Gullen · Answer 1 · 2011-01-24T12:31:51.123

Basic Methodology

When the fetch event is triggered (for example on Facebook pasting a URL in) you can use AJAX to request the url*, then parse the returned data as you wish.

Parsing the data is the tricky bit, because so many websites have varying standards. Taking the text between the title tags is a good start, along with possibly searching for a META description (but these are being used less and less as search engines evolve into more sophisticated content based searches).

Failing that, you need some way of finding the most important text on the page and taking the first 100 chars or so as well as finding the most prominent picture on the page.

This is not a trivial task, it is extremely complicated trying to derive semantics from such a liquid and contrasting set of data (a generic returned web page). For example, you might find the biggest image on the page, that's a good start, but how do you know it's not a background image? How do you know that's the image that best describes that page?

Good luck!

*If you can't directly AJAX third party URL's, this can be done by requesting a page on your local server which fetches the remote page server side with some sort of HTTP request.

Some Extra Thoughts

If you grab an image from a remote server and 'hotlink' it on your site, many sites seem to sometimes have 'anti hotlinking' replacement images when you try and display this image, so it might be worth comparing the requested image from your server page with the actual fetched image so you don't show anything nasty by accident.

A lot of title tags in the head will be generic and non descriptive, it would be better to fetch the title of the article (assuming an article type site) if there is one available as it will be more descriptive, finding this is difficult though!

If you are really smart, you might be able to piggy back off Google for example (check their T&C though). If a user requests a certain URL, you can google search it behind the scenes, and use the returned google descriptive text as your return text. If google changes their markup significantly though this could break very quickly!

score 4 · Answer 2 · answered Feb 11 '13 at 23:27

There are several API's that can provide this functionality, for example PageMunch lets you pass in a url and callback so that you can do this from the client-side or feed it through your own server:

http://www.pagemunch.com

An example response for the BBC website looks like:

{
"inLanguage": "en",
"schema": "http:\/\/schema.org\/WebPage",
"type": "WebPage",
"url": "http:\/\/www.bbc.co.uk\/",
"name": "BBC - Homepage",
"description": "Breaking news, sport, TV, radio and a whole lot more. The BBC informs, educates and entertains - wherever you are, whatever your age.",
"image": "http:\/\/static.bbci.co.uk\/wwhomepage-3.5\/1.0.64\/img\/iphone.png",
"keywords": [
   "BBC",
   "bbc.co.uk",
   "bbc.com",
   "Search",
   "British Broadcasting Corporation",
   "BBC iPlayer",
   "BBCi"
],
"dateAccessed": "2013-02-11T23:25:40+00:00"
}

What other services are there like this and/or what are they called? — Jonathan Ong, Jul 02 '13 at 09:05

score 4 · Accepted Answer · answered Jan 24 '11 at 13:59

You can use a PHP server side script to fetch the contents of any web page (look up web scraping). What facebook does is it throws out a call to a PHP server side script via ajax which has a PHP function called

file_get_contents('http://somesite.com.au');

now once the file or webpage has been sucked into your server-side script you can then filter the contents for anything in particular. eg. Facebooks get link will look for the title, img and meta property="description parts of the file or webpage via regular expression

eg. PHP's

preg_match(); Function.

This can be collected then returned back to your webpage.

You may also want to consider adding extra functions for returning the data you want as scraping some pages may take longer than expected to return your desired information. eg. filter out irrelevant stuff like javascript, css, irrelavant tags, huge images etc. to make it run faster.

If you get this down pat you could potentialy be on your way to building a web search engine or better yet, collecting data off sites like yellowpages, eg. phone numbers, mailing addresses, etc.

Also you may want to look further into:

get_meta_tags('http://somesite.com.au');

:-)

actually I use C#, but I think it will do the trick. I'll try to start with what you say about title, meta and img's, and after I'll try to do something more complex. Thanks a lot! — vtortola, Jan 24 '11 at 18:15

score 1 · Answer 4 · answered Jan 24 '11 at 12:23

1

You can always just look what it in the tag. If you need this in javascript it shouldn't be that hard. Once you have the data you can do:

var title = $(data).find('title').html();

The problem will be getting the data since I think most browsers will block you from making cross site ajax requests. You can get around this by having a service on your site which will act as a proxy and make the request for you. However, at that point you might as well parse out the title on the server. Since you didn't specify what your back-end language is, I won't bother to guess now.

answered Jan 24 '11 at 12:23

tster

17,883
5
53
72

Good point, I totally forgot about that. How is possible to get another website page from your page? How facebook does it? – vtortola Jan 24 '11 at 12:26
1

Facebook is written with PHP so they send AJAX request to "their own" PHP code that in turn send request to the other website. I'm sure you'll find that "proxy page" if you'll look deep enough. :) – Shadow The GPT Wizard Jan 24 '11 at 12:29
I see. I was trying to avoid that step, but it seems it will have to be done haha, thanks! – vtortola Jan 24 '11 at 13:00

score 1 · Answer 5 · answered Jan 24 '11 at 12:24

It's not possible with pure JavaScript due to cross domain policy - client side script can't read contents of pages on other domains unless that other domain explicitly expose JSON service.

The trick is sending server side request (each server side language has its own tools), parse the results using Regular Expressions or some other string parsing techniques then using this server side code as "proxy" to AJAX call made "on the fly" when posting link.

Get information from a web page (title, pictures, heads, etc...)

5 Answers5

Linked