1

I am filling out and submitting a form using PhantomJS and then outputting the resulting page. The thing is, I have no idea if this thing is being submitted at all.

I print the resulting page, but it's the same as the original page. I don't know if this is because it redirects back or I didn't submit it or I need to wait longer or or or. In a real browser it sends a GET and receives a cookie, which it uses to send more GETS before eventually receiving the final result - flight data.

I copied this example How to submit a form using PhantomJS, using a diferent url and page.evaluate functions.

var page = new WebPage(), testindex = 0, loadInProgress = false;

page.onConsoleMessage = function(msg) {
  console.log(msg);
};

page.onLoadStarted = function() {
  loadInProgress = true;
  console.log("load started");
};

page.onLoadFinished = function() {
  loadInProgress = false;
  console.log("load finished");
};

var steps = [
  function() {
    //Load Login Page
    page.open("http://www.klm.com/travel/dk_da/index.htm");
  },
  function() {
    //Enter Credentials
    page.evaluate(function() {

                     $("#ebt-origin-place").val("CPH");
                    $("#ebt-destination-place").val("CDG");
                    $("#ebt-departure-date").val("1/5/2013");
                    $("#ebt-return-date").val("10/5/2013");

    });
  }, 
  function() {
    //Login
    page.evaluate(function() {

    $('#ebt-flightsearch-submit').click() ; 

     # also tried:
     # $('#ebt-flight-searchform').submit();   

    });
  }, 
  function() {
    // Output content of page to stdout after form has been submitted
    page.evaluate(function() {
      console.log(document.querySelectorAll('html')[0].outerHTML);
    });
  }
];


interval = setInterval(function() {
  if (!loadInProgress && typeof steps[testindex] == "function") {
    console.log("step " + (testindex + 1));
    steps[testindex]();
    testindex++;
  }
  if (typeof steps[testindex] != "function") {
    console.log("test complete!");
    phantom.exit();
  }
}, 50);
Community
  • 1
  • 1
user984003
  • 28,050
  • 64
  • 189
  • 285
  • You might want to try CasperJS – it works with Phantom to make it a little more friendly. – Rich Bradshaw Mar 27 '13 at 12:05
  • I guess the thing is that I am not sure that anything will ever work with this page. Like they are actively thwarting scraping attempts. PhantomJs is the fourth thing that I am trying. – user984003 Mar 27 '13 at 12:08
  • Use Casper, pause for around 400ms between actions, change the User Agent to something anonymous e.g. 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.28.10 (KHTML, like Gecko) Version/6.0.3 Safari/536.28.10' (that's on Webkit like Phantom, but is the same as any Safari user on OSX 10.8.3), would be surprised if that didn't work. – Rich Bradshaw Mar 27 '13 at 13:14
  • The thing that confuses me (or one thing that confuses me) is the waiting between actions. Do I need an action for every page/ajax call that is loaded or is it like a real browser where I submit the form and it does everything else? For example, the site displays a "waiting" page before displaying the actual data. – user984003 Mar 27 '13 at 13:22
  • I'm just waiting because that's something I might use to check if it's a bot. What you describe is needed though - casper has a nice thing called waitForSelector: http://casperjs.org/api.html#casper.waitForSelector it lets you only continue when a selector is matched, so that should fix this for you. – Rich Bradshaw Mar 27 '13 at 14:56
  • Turns out that Capser doesn't work for my needs. You can only fill out a form using a name selector. I need to be able to do it using an ID selector. – user984003 Mar 31 '13 at 13:45
  • You can still do that, though you might need to use code like you using with phantom. – Rich Bradshaw Mar 31 '13 at 16:13

1 Answers1

0

The site of interest is rather complicated to scrape. I logged the HTTP traffic from the US KLM site and got this:

GET /travel/us_en/apps/ebt/ebt_home.htm?name=on&ebt-origin-place=New+York+-+John+F.+Kennedy+International+%28JFK%29%2CNew+York&ebt-destination-place=Paris+-+Charles+De+Gaulle+Airport+%28CDG%29%2C+France&c%5B0%5D.os=JFK&c%5B0%5D.ost=airport&c%5B0%5D.ds=CDG&c%5B0%5D.dst=airport&c%5B1%5D.os=CDG&c%5B1%5D.ost=airport&c%5B1%5D.ds=JFK&inboundDestinationLocationType=airport&redirect=no&chdQty=0&infQty=0&c%5B0%5D.dd=2013-07-31&c%5B1%5D.dd=2013-08-14&c%5B1%5D.format=dd%2Fmm%2Fyyyy&flex=true&ebt-cabin-class=ECONOMY&adtQty=1&goToPage=&cffcc=ECONOMY&sc=false HTTP/1.1

Your injected values for the form elements are not what their server is looking for.

Inside page.evaluate(), you are sandboxed, but the sample code includes a hook to get sandboxed console activity onto the external console. For other debugging, you can also include object inspectors, etc., but they have to be injected into the page or part of the code passed into evaluate().

Lester Buck
  • 797
  • 6
  • 8