3

I've already searched on the Internet how to "create" a simple headless browser, because I was interested to know how does a Browser works internally. I'd like to implement a simple headless-browser.

What I mean is: suppose you have an html string, and a javascript string, both as a result of a HttpRequest to the server; how can I apply the javascript into the html string?

For example: I requested to an X server the html source file, and I obtained in the response this:

<html>
    <head>
         <script type="text/javascript" src="javascript.js">
    </head>
    <body>
        <p id="content"></p>
    <body>
</html>

Then, I request the javascript.js file, and I obtain this:

document.getElementById("content").text = "Hello";

How can I apply the content of the javascript.js file into the html file? The steps I should follow is something similar to this?:

  1. Parse html source to Javascript DOM elements
  2. Apply javascript to the DOM

I'd like to do it with Java, Scala or Node.js. Idk if you understand the main idea... im latin american, and my english isn't so good. Sorry for that. If dont understand, please let me know in the comments and I'll edit my post.

EDIT: what I would like to do, in other words, is like a pseudo method/function like this (in pseudocode):

function applu(html, js){
    // Apply js into html
}
Malvrok
  • 369
  • 4
  • 16
  • JavaScript gets **included** in the page. As long as you reference the JavaScript file, it will execute. In your above example, `content` will become `Hello`. Depending on how you're loading the file, you may need to move the script inclusion **below** the `

    ` in your HTML DOM, or use something like jQuery's `$.onready()` to ensure that the element actually **exists** before you attempt to modify it :)

    – Obsidian Age Feb 19 '17 at 22:01
  • But how do I send the html and the js file in, for example, the Nashorn Javascript Java 8 Engine and run/update the html content? Im able to choose between Java8 and Node.js, but idk how to make this. If someone could give me a brief and simple example, I'd really appreciate it. – Malvrok Feb 19 '17 at 22:12
  • Possible duplicate of [How can I run JavaScript code at server side Java code?](http://stackoverflow.com/questions/1999503/how-can-i-run-javascript-code-at-server-side-java-code) – Jamie Birch Feb 19 '17 at 23:41
  • Are you looking to write your own browser to learn, or looking to execute the JavaScript on the server against the HTML for some project (and you'd like to avoid writing your own DOM engine)? – Sean Vieira Feb 20 '17 at 00:02
  • What im trying to do is to create my own "headless browser", cause PhantomJS, CasperJS and others, don't have lot of the features that I need (cancelling certain http requests, etc). With this headless browser, I want to obtain certain info so I can scrape without problems. Btw, I have already used PhantomJS and CasperJS for some months (6 more less). – Malvrok Feb 20 '17 at 18:59

1 Answers1

3

If you're looking a headless browser I'm sure you're aware of phantomsJS. PhantomJS is a headless browser based off apple's webkit browser engine.

You're asking for a lot here. You need:

  1. a javascript runtime (such as v8) to run the javascript.
  2. a web engine to bring the html and the document object model it defines to life.

Both of those things take millions of lines of code to execute.

My recommendation is integrate your program with PhantomJS. PhantomJS is a headless webbrowser and a javascript environment. If you're using scala, start a child process of phantomjs and send messages to it via std i/o. The JS part of PhantomJS means that you use it via it's javascript API, so additionally you'd have to write a js script to handle the messages coming in from std i/o. It's undocumented but phantomjs has a system.std.in and system.std.out apis to handle the messages.

That's a lot of work and a lot of extra resources outside of the JVM to get it work. I saw that you're using scala so you could go with a simpler solution using jsoup to parse and modify the HTML document, however you would have to do the transformations using scala (or java).

Actually, now that I think about it, you should use jsdom paired with nodejs. JSDom implements the dom API without actually rendering it which might be what you need. jsdom is made for nodejs which is headless. You can also use node's std i/o and have it send messages to and from the JVM if you wanted to use both scala and node.


Here is a proof of concept to using jsdom to evaluate the javascript and modify the html. It's a really simple solution and it is the most resource efficient for the given task (and this is a hard task).

I made a gist for you with a very simple proof of concept. To run the gist do:

git clone https://gist.github.com/c8aef41ee27e5304e94f6a255b048f87.git apply-js-to-html
cd apply-js-to-html
npm install
node example.js

This is the meat of the example:

const jsdom = require('jsdom');

module.exports = function (html, js) {
    return new Promise((resolve, reject) => {
        jsdom.env(html, (error, window) => {
            if (error) {
                reject(error);
            }
            try {
            (function evalInContext () {
                'use strict';
                const document = this.document;
                const window = this.window;
                eval(js);
                resolve(window.document.documentElement.innerHTML);
            }).call(window);
            } catch (e) {
                reject(e);
            }
        });
    });
}

And here is the module in use

const applu = require('./index');

const html = `
    <html>
        <head></head>
        <body>
            <p id="content"></p>
        <body>
    </html>
`;

const js = `document.getElementById("content").innerHTML = "Hello";`

applu(html, js).then(result => {
    console.log('input html: ', html);
    console.log('output html: ', result);
}).catch(err => console.error(error));

And here is the output of the code:

input html:  
    <html>
        <head></head>
        <body>
            <p id="content"></p>
        <body>
    </html>

output html:  <head></head>
        <body>
            <p id="content">Hello</p>


</body>

jsdom creates a headless window and document environment that doesn't render anything. You can use eval and call it in context using window as the this value. I've also declared document and window again the js that will be evaled will have those variables in scope.

This is a just a basic POC, you'll have iron out the details by yourself.

Community
  • 1
  • 1
Rico Kahler
  • 17,616
  • 11
  • 59
  • 85
  • Thank you for your answer Rico!! I followed and tried what you suggested, me. The problem that I found is that it is very difficult to achieve small results (imagine if I want to handle thousands of requests in a small period of time xD). I guess I'll look for another solution. – Malvrok Feb 20 '17 at 19:08
  • You could try using node's [`cluster`](https://nodejs.org/api/cluster.html) API to spin up workers for multi threading. This solution though is probably the least resource intensive for a comprehensive solution – Rico Kahler Feb 20 '17 at 19:22