3

Given a web page with JavaScript code, I would like to generate a resulting html automatically (either via CLI tool OR using some library in some language)

For example, given test.html

<!DOCTYPE html>
<html>
  <body>
    <p id="demo"></p>
    <script>
      document.getElementById("demo").innerHTML = "Hello JavaScript!";
    </script>
  </body>
</html>

I would like to get as a result

<html>
  <body>
    <p id="demo">Hello JavaScript!</p>
    <script>
      document.getElementById("demo").innerHTML = "Hello JavaScript!";
    </script> 
  </body>
</html>
Timofey
  • 2,478
  • 3
  • 37
  • 53
  • 1
    Could you serve the page on your localhost then scrape it with phantom? – Daniel Lizik Nov 19 '15 at 17:17
  • I think @Tim is trying to parse a html file and insert text in the

    tag. BeautifulSoup should do the work.

    – Yu Wu Nov 19 '15 at 17:22
  • How about using a WebClient and getting the resulting dom from it? I imagine you could do this but haven't tested it, hence the comment instead of an answer. – Darren Gourley Nov 19 '15 at 17:25
  • @Daniel_L The page is served from the host I have no control – Timofey Nov 19 '15 at 18:13
  • @DarrenGourley are you speaking about htmlunit http://htmlunit.sourceforge.net/? If so then it brakes during the processing of my page unfortunately – Timofey Nov 19 '15 at 18:14
  • @YuWu not exactly. The JavaScript might invoke some external url and modify the current page's DOM. A given example is simple and serves for demo purposes only. – Timofey Nov 19 '15 at 18:18
  • @AdamBuchananSmith thanks, will check their APIs http://doc.jsfiddle.net/api/ – Timofey Nov 19 '15 at 18:19
  • Sorry @Tim, I may have gotten the wrong end of the stick with your question, for some reason I assumed it had a .NET tag. WebClient is baked in to .NET and allows you to request Web pages as you would in a browser, only programmatically. – Darren Gourley Nov 19 '15 at 18:22
  • You're misunderstanding @Daniel_L's comment, which is actually a correct answer (almost). Just run a phantom program which loads the page (from the host), then grab the full page content. –  Nov 19 '15 at 19:13
  • @torazaburo the comment was modified. I will definitely try http://phantomjs.org/screen-capture.html thanks – Timofey Nov 19 '15 at 19:17
  • Screen capture is close to what you want, but not exactly. You'll want to grab the entire page HTML and save it to file somewhere most likely. –  Nov 19 '15 at 19:22
  • @torazaburo do you have perhaps an example? If you feel you have right solution, you can write an answer and get it accepted :) – Timofey Nov 19 '15 at 19:24

2 Answers2

0

After a quick search, it looks like watin will do what you want.

It's aimed at automated testing, but when it hits a page it will execute all js as well as ajax calls etc. Looks like you can grab the resulting html from it too.

Darren Gourley
  • 1,798
  • 11
  • 11
  • Thanks, I will check this soon. Are you familiar with some other library for some open source stack (java, scala, clojure)? – Timofey Nov 19 '15 at 18:16
  • Sorry @Tim, as per my comment above I thought this question had a .NET tag. I'm not aware of any other libraries that would do this. – Darren Gourley Nov 19 '15 at 18:24
0

The answer is based on the comment of @torazaburo

In fact, the phantomjs is capable of evaluating javascript and producing html.

Here is how it could look like, executing phantomjs load_page.js path_to/test.html

var page = require('webpage').create(),
    system = require('system'),
    page_address;
var fs = require('fs');
if (system.args.length === 1){
  console.log('Usage: phantomjs ' + system.args[0] + ' <page_to_load:http://www.google.com>');
  phantom.exit();
}
page_address = system.args[1]

page.open(page_address, function(status){
    console.log('Status:' + status);
    if (status === 'success' ){
      fs.write('phantom_result.html', page.content, 'w')
    }
    phantom.exit();
});
Timofey
  • 2,478
  • 3
  • 37
  • 53