0

I will have a widget on a remote page. In the widget I want javascript or jquery to get all the article content from the webpage and send it back to my website. I only need just the article content and not all the other information on the webpage. I would like the script to send the remote webpage url, page content, title text, and h1 text. I would not like to receive any html tags. Is this possible to do?

The script I am making is like google adsense. Also, Ill be using c# as my backend server

will something like this work? http://blog.nparashuram.com/2009/08/screen-scraping-with-javascript-firebug.html

Luke101
  • 63,072
  • 85
  • 231
  • 359
  • 2
    It sounds like you want to access a page on a different domain, is this the case? – Nick Craver May 29 '10 at 01:26
  • Just wanted the mention the keywords to use when searching for this topic on google is "page scrape". As mentioned below, you can't do this cross domain so you'll need some server code to pull it in, parse, and spit out the data you want in the format of your choice. – Zachary May 29 '10 at 02:25

2 Answers2

2

my suggestion, if it's not too much data would be to use a beacon.

var beac = new Image();
beac.onload = function () {
  //do somethiringng on completion
}
beac.src = "youdomain/somthing.php?var=asdasd&key=someUniqueString";

This allows you to send a moderate amount of data to a server on another domain, provided you don't need anything back.

Rixius
  • 2,223
  • 3
  • 24
  • 33
  • "provided you don't need anything back"...I don't want to seem rude, but did you read the question? – Nick Craver May 29 '10 at 02:14
  • yes I did, he simply said he wants to send the title, h1 and url for the current page back to his domain. – Rixius May 29 '10 at 02:18
  • and article content, but all the transfer mentioned is from the current page to his page. – Rixius May 29 '10 at 02:19
  • @Rixius - In the question..."I want javascript or jquery to get all the article content from the webpage" – Nick Craver May 29 '10 at 02:22
  • "to get all the article content from the webpage and send it back to my website." "webpage"(Browser) to "my website"(server) – Rixius May 29 '10 at 02:33
  • Will something like this work? http://blog.nparashuram.com/2009/08/screen-scraping-with-javascript-firebug.html – Luke101 May 29 '10 at 02:49
  • 2
    @Nick re-reading the question I think this does solve the problem. – Rex M May 29 '10 at 02:53
  • 1
    +1 he could always create a ` – cryo May 29 '10 at 03:29
  • @Rixius - He wants to get a page from a remote server, get items from that page and send them back to his server with the client as the middle man....I don't see how this answer does that at all...do you plan to fit the article he's talking about in a GET request? – Nick Craver May 29 '10 at 23:08
1

In short you can't do this, at least not in the way you were expecting. For security reasons there's a same-origin policy in place that prevents you from making requests to another domain.

Your best option is to do this on your server and make the request to it. I can't speak as to how you'd do this on the server since your question doesn't include which framework you're on, but let's say it's PHP, then you'd have that page take a URL, or something you can generate the URL from, then return a JSON object containing the properties you listed. The jQuery part would look something like this:

$("a").click(function() {
  $.ajax({
    url: 'myPage.php',
    data: { url: $(this).attr("href") },
    dataType: 'json',
    success: function(data) {
      //use the properties, data.url, data.content, data.title, etc...
    }
  });
});

Or, the short form using $.getJSON()...

  $.getJSON('myPage.php', { url: $(this).attr("href") }, function(data) {
      //use the properties, data.url, data.content, data.title, etc...
  });

All the above not withstanding, you're better off sending the URL to your server and doing this completely server-side, it'll be less work. If you're aiming to view the client's page as they would see it...well this is exactly what the same-origin policy is in place to prevent, e.g. what if instead of an article it was their online banking? You can see why this is prohibited :)

Nick Craver
  • 623,446
  • 136
  • 1,297
  • 1,155
  • will something like this work? http://blog.nparashuram.com/2009/08/screen-scraping-with-javascript-firebug.html – Luke101 May 29 '10 at 02:49
  • @Luke - It would, assuming you're still within the cross-domain bounds (plugins have more lead-way here)...but this is still *far* easier to do totally server-side. – Nick Craver May 29 '10 at 10:50