2

I'm trying to scrape some HTML (with permission from the author). I was using the PHP library suggested here, and it was working well until I encountered a link that looks like this:

<a href="javascript:__doPostBack('dgItem$_ctl2$_ctl0','')">

Which I believe is some asp.net thing. When I click it, it doesn't change the URL, it just loads some new content into the page, which I'd also like to scrape.

How can I get around this?

I suppose I would need to simulate the click, but I can't do that when processing raw HTML, I'd need some kind of browser/JS interpreter, no?

Is there a better suited library for this task? I'm not limited to PHP, but it's preferred.

Community
  • 1
  • 1
mpen
  • 272,448
  • 266
  • 850
  • 1,236
  • Reading [this article](http://blog.databigbang.com/web-scraping-ajax-and-javascript-sites/) now... – mpen Jun 26 '12 at 22:40
  • You might be interested in [this project](http://scraperblog.blogspot.com/2012/11/introducing-pgbrowser.html) – pguardiario Nov 05 '12 at 07:47
  • @pguardiario: It says it does forms and cookies, but there's no mention of JS. – mpen Nov 05 '12 at 08:03
  • 1
    Take another look, it specifically does the doPostBack actions that you're talking about. – pguardiario Nov 05 '12 at 08:26
  • @pguardiario: Ah..while that may have worked for this project, that would still be less versatile than something with full JS support no? Probably a lot quicker though. If this comes up again, I'll look into that. Thanks! – mpen Nov 05 '12 at 08:30
  • Yes but doPostBack actions don't require full JS support. I consider selenium to be overkill. – pguardiario Nov 05 '12 at 09:00

2 Answers2

7

__doPostBack() is indeed an ASP.NET thing. Here's what the function does:

var theForm = document.forms['FORMNAME'];
if (!theForm) {
    theForm = document.FORMNAME;
}
function __doPostBack(eventTarget, eventArgument) {
    if (!theForm.onsubmit || (theForm.onsubmit() != false)) {
        theForm.__EVENTTARGET.value = eventTarget;
        theForm.__EVENTARGUMENT.value = eventArgument;
        theForm.submit();
    }
}

Basically, it sets the values of two hidden fields (__EVENTTARGET and __EVENTARGUMENT) to the respective values of the parameters. Then it submits the form.

If you wanted to, you could continue using the PHP HTML parser to do the job, but when you encounter one of these __doPostBack() links, you'd have to craft a POST request manually. At a high level, you'd be looking at something like this:

  1. Get the current form values. You'd probably have to loop through each input element, etc. and add the values to an array. If there are no text boxes, checkboxes, etc. on the page, you should only be left with the hidden fields .NET embeds by default (e.g., __VIEWSTATE, __EVENTVALIDATION, etc.).
  2. Parse out the values passed to doPostBack() and overwrite the existing values for __EVENTTARGET and __EVENTARGUMENT.
  3. Craft your POST request. I'm not sure what (if anything) the library you're looking at provides this way, but a popular way to do this from PHP would be through the cURL extension. For an example, see http://davidwalsh.name/execute-http-post-php-curl.
  4. Get the HTML result and parse with the library as usual.

Alternatively, if you're always making pretty much the same request to the same page, you could probably skip some steps in parsing the form and just jump straight to crafting the POST request.

That's not going to be a ton of fun, but it would work for this case. If you needed to deal with more complicated cases involving JS, or if you just want to handle this a different way, there are (as you mentioned) libraries that basically drive browsers and handle these things for you. The two that come to mind first are:

There are other options too, but I don't know of any that are going to be quick and easy to integrate into an existing PHP script.

Jonathan S.
  • 2,238
  • 16
  • 16
  • Just started the PHP script, so I'm not overly concerned if I have to start over, but I do like the jQuery-like syntax of this library. I'm going to look into the 2 libraries you suggested, and if those don't work, I might try hacking the post as you suggested. Thanks! – mpen Jun 26 '12 at 23:57
1

I ended up using Python with Selenium Firefox web driver. Since I'm using a real browser, I can do everything FF can.

mpen
  • 272,448
  • 266
  • 850
  • 1,236