0

I've been looking at how to automate actions on a webpage with PhantomJS, however I'm having issues manipulating the page to do what I want it to.

I'm using this as test site. I've managed to get Phantom to open the webpage and scrape the random sentence from the #result span. But now what I want to do is get another sentence without re-launching the script. I don't want to close and re-open the page as Phantom takes ages to launch the webkit and load the page. So I thought I could get another sentence by getting Phantom to click on the 'Refresh' button below the sentence box. Here's what I have at the moment:

var page = require('webpage').create();

console.log("connecting...");   

page.open("http://watchout4snakes.com/wo4snakes/Random/RandomSentence", function(){    
    console.log('connected');
    var content = page.content;
    var phrase = page.evaluate(function() {
        return document.getElementById("result").innerHTML;
    });

    console.log(phrase);
    page.includeJs("http://ajax.googleapis.com/ajax/libs/jquery/1.6.1/jquery.min.js", function() {
        page.evaluate(function() {
            $("frmSentence").click();
        });
    });

    var content = page.content;
    var phrase = page.evaluate(function() {
        return document.getElementById("result").innerHTML;
    });

    console.log(phrase);
    phantom.exit();
});

As you can see I'm trying to click the refresh button by using a .click() function, but this isn't working for me as I still get the same sentence as beforehand. Given the HTML for the button:

<form action="/wo4snakes/Random/NewRandomSentence" id="frmSentence" method="post" novalidate="novalidate">        
    <p><input type="submit" value="Refresh"></p>
</form>

I'm not sure what I should be referencing in the script to be clicked on? I'm trying the form ID 'frmSentence' but that isn't working. I'm wondering if .click() is the right way to go about this, is there some way for Phantom to submit the form that the button is linked to? Or maybe I can run the associated script on the page that gets the sentence? I'm a bit lost on this one so I don't really know which method I should go with?

Artjom B.
  • 61,146
  • 24
  • 125
  • 222
kmahon99
  • 59
  • 8

2 Answers2

0

web scraping is about sending require information to a web server and get the result. It is not about behaving like a user clicking button or entering search criteria. All you need to do in this example is send a POST request to http://watchout4snakes.com/wo4snakes/Random/NewRandomSentence. The result is just text in page.content, it does not even need to evaluate. So to get more than one sentence you just need to do a loop of page.open

wayne
  • 3,410
  • 19
  • 11
  • Thanks for the answer. The initial issue I have with `page.open` is that it takes ages to connect and get the data, I want to be able to get a new sentence relatively seamlessly. If it's just about sending the POST to the given address then is there a way to do this automatically with Phantom while the page is still open? Or is sending the POST part of the `page.open` command? – kmahon99 Sep 26 '14 at 11:56
  • you can try `page.reload()` if you open to `NewRandomSentence`. But you might need to test on the reliability. I scrape a 100K+ data with 3 mini steps each time(search, get to detail page, download report) and PhantomJS crashed as often as 5 requests.. after change to do page.open for every request it is slightly more reliability, crashing every 100 requests or so. Maybe it is only for https not http I am not sure. – wayne Sep 26 '14 at 12:24
0

You have a problem with your control flow. page.includeJs is an asynchronous function. If you have some other statements page.includeJs, they are likely executed before the script is loaded and the callback is executed. It means in your case that you've read the sentence 2 times before you even trigger a click.

If you want to do this multiple times, I suggest to use recursion since you cannot write this synchronously. Also, since you want this to be fast, you cannot use a static setTimeout with a timeout of 1 second, because sometimes the request may be faster (you lose time) and sometimes slower (your script breaks). You should use waitFor from the examples.

Instead of loading jQuery every time, you can move page.includeJs up and include everything else in its callback. If you only need to click an element or if jQuery click doesn't work (yes, that happens from time to time), you should use PhantomJS; click an element.

Community
  • 1
  • 1
Artjom B.
  • 61,146
  • 24
  • 125
  • 222