1

I want to create a web app which organizes and analyses information from another website. The other website has no API, so I want to just take all the HTML from it (after allowing its scripts to run) and have that available to me for picking apart using jquery for my web app.

I realize that PHP or other server-side language is the true answer to this issue, but I only know front-end stuff and just want to make something quick and dirty. No one is using this but me.

The only way I can think to achieve this right now is by using a hidden iframe. Is there anything (relatively) more elegant than this solution?

Brimby
  • 847
  • 8
  • 25

5 Answers5

3

You can do that easily with a YQL rest call.

See examples here: https://developer.yahoo.com/yql/guide/yql-select-xpath.html

Basically you only need to do some AJAX call to Yahoo's YQL server and it will return a response and inside it you will find the HTML of the page you queried.

Playground link - as you can see the REST query is at the bottom of the page.

update -

google "scraping webpages using phantomjs", you'll get exactly what you need to scrap and parse pages and get the final result.

Community
  • 1
  • 1
vsync
  • 118,978
  • 58
  • 307
  • 400
  • 1
    Even if he uses YQL he will only get the source of the page as returned by the server, not with any javascripts executed (as he specifically requested). – Karl-Johan Sjögren Nov 16 '14 at 21:25
  • @Karl-JohanSjögren - SO? he can feed it into a headless browser, like phantomJS) and get back the result. that is the easy part. he only didn't know how to actually get the HTML. a quick google shows a lot of good results, such as this - http://nrabinowitz.github.io/pjscrape/ – vsync Nov 16 '14 at 21:32
  • Yes phantomjs or similar technology is probably the best way to go, so why use YQL at all then? A headless browser could go directly to the page (which it should to get all referenced scripts etc.). – Karl-Johan Sjögren Nov 16 '14 at 22:12
  • well I missed the part where he says he wants the javascript to run. I have updated my answer. – vsync Nov 16 '14 at 22:28
1

In fact hidden iframes will likely not work, as most browsers do not allow javascript from one window to run in the context of another window (such as an iframe).

Why don't you just get the HTML using jQuery.get()?

Ben
  • 2,867
  • 2
  • 22
  • 32
  • 1
    jQuery.get() won't work with remote servers (unless that server has some very allowing CORS-rules) and it also won't execute the scripts on the webpage as the OP put as a requirement. – Karl-Johan Sjögren Nov 16 '14 at 19:01
  • Fair enough. I missed the scripts part. How do server side rules know it's a request from script? – Ben Nov 16 '14 at 19:02
  • How about `jQuery.ajax()`? – Wouter Florijn Nov 16 '14 at 19:06
  • 1
    The server doesn't know about it, but modern browsers won't allow the call unless proper headers are returned. This is mainly to stop malicious sites from making requests to external sources (which would include any cookies you have for that domain) and retrieving data it shouldn't access. https://developer.mozilla.org/en-US/docs/Web/HTTP/Access_control_CORS – Karl-Johan Sjögren Nov 16 '14 at 19:12
1

You can't access the DOM from an IFrame as it violates Cross Origin Resource Sharing (CORS) rules. With IFrames, you need for both frames to be served from the same hosts (i.e. at least foo.com and bar.foo.com) - and if they're different in any way you then have to explicitly set the document.domain property. It's like this so that a page can't just include an IFrame to your bank / facebook / other sites with sensitive information and simply steal the contents. See MDN for more

If you really want to just lift the HTML content from a site, then using JavaScript isn't an optimal solution, due to cross origin policies which exist for good reasons.

Ian Clark
  • 9,237
  • 4
  • 32
  • 49
1

Simple Answer: NO


Modern browsers won't let you do that otherwise they're insecure.

Details of ways to request a webpage can be found in this question, but all require you to be on the same domain.

My suggestions :

Option A: Take a sunday off (like today!) and learn some basic server stuff. You already know JavaScript, you can learn to build a simple web server with nodejs just in a day!

Option B: You really don't want to touch back-end stuff. Consider build your app as a chrome app. In this method, you can ask user politely to get content from remote locations.

Community
  • 1
  • 1
Mark Ni
  • 2,383
  • 1
  • 27
  • 33
1

Is your goal to essentially "screen scrape", using JavaScript?
If so, a website will not work (for security reasons), but you might still have an option.

You can create a "bookmarklet", by pasting Javascript into a bookmark, preceded with javascript:. Then, you simply open the webpage, and click your bookmark, and your code is executed.

I recommend writing all the code in an actual JavaScript file first, and just pasting it into the bookmark. As an example:

javascript:
alert("hello");
Scott Rippey
  • 15,614
  • 5
  • 70
  • 85