1

so what I want to mimic is the link share feature Facebook provides. You simply enter in the URL and then FB automatically fetches an image, the title, and a short description from the target website. How would one program this in javascript with node.js and other javascript libraries that may be required? I found an example using PHP's fopen function, but i'd rather not include PHP in this project.

Is what I'm asking an example of webscraping? Is all I need to do is retrieve the data from inside the meta tags of the target website, and then also get the image tags using CSS selectors?

If someone can point me in the right direction, that'd be greatly appreciated. Thanks!

Ernesto11
  • 281
  • 1
  • 3
  • 6

2 Answers2

3

Look at THIS post. It discusses scraping with node.js. HERE you have lots of previous info on scraping with javascript and jquery.

That said, Facebook doesn't actually guess what the title and description and preview are, they (at least most of the time) get that info from meta tags present in the sites that want to be more accessible to fb users.

Maybe you could make use of that existing metadata to pull titles, descriptions and img previews. The docs on the available metadata is HERE.

Community
  • 1
  • 1
jcane86
  • 681
  • 4
  • 17
  • Yes this will help if the page has opengraph meta data. In other cases we need to use some heuristics. – AppleGrew Apr 14 '11 at 09:37
  • Yep thanks both for the answers. I was looking more for detailed plan on how to approach it, as I've already read through all the tutorials and guides listed-I did do my research. I learn best by looking through example code. Can someone link me to some code that has done something similar with javascript? How do you scrape through html tags on a page? – Ernesto11 Apr 15 '11 at 11:37
  • In addition, I was wondering what is the difference between screen scraping and web scraping. Is AJAX required for what i'm trying to do? Ive read a lot of posts mentioning that AJAX cannot retrieve data from another domain's website because of security issues. – Ernesto11 Apr 15 '11 at 11:44
  • @Ern I haven't actually done scraping before, sorry but no code to share on that. About AJAX, it's not required, instead, as you said, it isn't even possible to make cross site requests through AJAX. So you should do all the scraping server-side. You can always add an AJAX layer in your presentation for usability purposes, (i.e. AJAX request to your server-side code to do the scraping, retrieve the results and render them) but the actual scraping would always be done server-side. – jcane86 Apr 15 '11 at 18:19
1

Yes web-scraping is required and that's the easy part. The hard part is the generic algo to find headings and relevant texts and images.

How to scrape

You can use jsdom to download and create a DOM structure in your server and scrape that using jquery on your server. You can find a good tutorial at blog.nodejitsu.com/jsdom-jquery-in-5-lines-on-nodejs as suggested by @generalhenry above.

What to scrape

I guess a good way to find the heading would be:-

var h;
for(var i=6; i<=1; i++)
 if(h = $('h'+i).first()){
  break;
 }

Now h will have the title or undefined if it fails. The alternative for this could be simply get the page's title tag. :)

As for the images. List all or first few images on that page which are reasonably large, i.e. so as to filter out sprites used for buttons, arrows, etc.

And while fetching the remote data make sure that ProcessExternalResources flag is off. This will ensure that script tags for ads do not pollute the fetched page.

And yes the relevant text would be in some tags after h.

AppleGrew
  • 9,302
  • 24
  • 80
  • 124
  • Thanks for the response. So why does the following code not work to get the meta data. – Ernesto11 Apr 16 '11 at 00:51
  • var meta=$('meta[name="description"]').attr("content"); Doesn't that read the meta element inside the $() jquery function? – Ernesto11 Apr 16 '11 at 00:53