1

I need to extract email address off a web page. The webpage contains a link to email address. I click on the link. It sends an XHR request. The ajax response is captured by a js script that parses the response and opens a mail client.

As the Ajax response doesn't change the html in any way, I can't extract the email by monitoring the html.

I need to capture the Ajax response myself so that I can parse it and save it in a database.

#
# Initialize browser etc.
#
driver = webdriver.PhantomJS()
emailLink = driver.find_element_by_class_name('email_add')
emailLink.click()

#There is no change in html. I can't find the email address

By using Firefox webdriver in place of PhantomJS, I ensured that the code is working fine. Firefox opens a mail client in response to ajax reply.

I tried issuing the request using requests and urllib2, but somehow the webserver identifies these manually generated requests and redirects to the home page.

Faisal
  • 447
  • 2
  • 6
  • 15

2 Answers2

0

I took the intercept code from here and wrapped it in a PhantomJS script which injected it into the page I was scraping. Note that the page has to be loaded before injecting the XHTTP intercept. Also, had to tell PhantomJS to intercept and print out messages printed to console.log.

I used the [functions] technique from Vijay's accepted answer here

For a more interesting live data feed try using http://flightaware.com/live/ instead of maps.google.com below, but be patient, it may take a minute or five to get an update.

Here's the partial (untested other than parse, sorry) PhantomJS script:

  var page = new WebPage(), testindex = 0, loadInProgress = false;

  page.onLoadStarted = function() {
    loadInProgress = true;
    console.log("load started");
  };

  page.onLoadFinished = function() {
    loadInProgress = false;
    console.log("load finished");
  };

  page.onConsoleMessage = function(msg) {
    console.log(msg);
  };

  var steps = [
  function() {
    //Load Login Page
    page.open("http://maps.google.com");
  },    
  function() {

    page.render('check.png');  // see what's happened.
    page.evaluate(
     function( x) {
    //inject following code from https://gist.github.com/suprememoocow/2823600
    // I've added console.log() calls along with onConsoleMessage above to see XHR responses.
    (function(XHR) {
        "use strict";

        var stats = [];
        var timeoutId = null;

        var open = XHR.prototype.open;
        var send = XHR.prototype.send;

        XHR.prototype.open = function(method, url, async, user, pass) {
            this._url = url;
            open.call(this, method, url, async, user, pass);
        };

        XHR.prototype.send = function(data) {
            var self = this;
            var start;
            var oldOnReadyStateChange;
            var url = this._url;

            function onReadyStateChange() {
                if(self.readyState == 4 /* complete */) {
                    var time = new Date() - start;                
                    stats.push({
                        url: url,
                        duration: time                    
                    });

                   console.log( "Request:" + data);
                   console.log( "Response:" + this.responseText );

                    if(!timeoutId) {
                        timeoutId = window.setTimeout(function() {
                            var xhr = new XHR();
                            xhr.noIntercept = true;
                            xhr.open("POST", "/clientAjaxStats", true);
                            xhr.setRequestHeader("Content-type","application/json");
                            xhr.send(JSON.stringify({ stats: stats } ));                        

                            timeoutId = null;
                            stats = []; 
                        }, 2000);
                    }                
                }

                if(oldOnReadyStateChange) {
                    oldOnReadyStateChange();
                }
            }

            if(!this.noIntercept) {
                start = new Date();

                if(this.addEventListener) {
                    this.addEventListener("readystatechange", onReadyStateChange, false);
                } else {
                    oldOnReadyStateChange = this.onreadystatechange; 
                    this.onreadystatechange = onReadyStateChange;
                }
            }

            send.call(this, data);
        }
    })(XMLHttpRequest);


     },""
    );
    }, 
    function() {
        // try something else here.  Add more steps as necessary
    }
];

interval = setInterval(function() {
  if (!loadInProgress && typeof steps[testindex] == "function") {
    console.log("step " + (testindex + 1));
    steps[testindex]();
    testindex++;
  }
  if (typeof steps[testindex] != "function") {
     // commented out to run until ctrl-c
    //console.log("test complete!");
    //phantom.exit();
  }
}, 500);
Community
  • 1
  • 1
JJones
  • 802
  • 7
  • 7
-1

I tried issuing the request using requests and urllib2, but somehow the webserver identifies these manually generated requests and redirects to the home page.

If this is the problem then make the server think that the request is coming from a browser. change the user agent

Changing user agent on urllib2.urlopen

Community
  • 1
  • 1
Shamik
  • 1,591
  • 2
  • 16
  • 36