2

I'm working with a form that has several fields, some text, and several hidden. The problem is that when I look at the list of fields that my mechanize.Browser object "sees", some important hidden fields are missing, but not all. According to the most popular answer for this similar question, this is happening because the web page is querying the user-agent string. That is not the case for me, and I know this for two reasons:

  1. When I save the "scraped" form to a file, I can see the missing fields, and
  2. I've altered my browser object's user-agent string, as that solution suggests, but it does not help me.

What does help me is the second most popular solution to that issue, but I don't understand why this is. Why would Mechanize "see" some hidden form fields but not others, requiring manual input of the missing fields?

Community
  • 1
  • 1
BlueBomber
  • 282
  • 2
  • 11
  • Hi! Could you please post the html source on which mechanize loses fields somewhere? I'm really interested in investigating it. – alex_jordan Nov 12 '12 at 20:51
  • No, I'm sorry. I know that it would be helpful, but I cannot share the actual form, *but*, I'll reiterate that it seems to be a rather plain vanilla form. When I open it in a browser and either view or save the HTML, all fields are there. In mechanize, some are not. – BlueBomber Dec 31 '12 at 00:54

1 Answers1

1

Granted I don't know what you're actually trying to you - but as someone who's been scraping webpages for years I have to give you some unsolicited advice. I apologize in advance.

I would strongly urge you to transition over to something that can handle javascript. Mechanize is a great module, it was amazingly useful back in the day, but the web is all blinking lights, CSS and dancing babies you have to click.

The reason I say this, is that the 'hidden' fields could be something fancy, or they could be javascript modified forms that you'll waste hours trying to reverse engineer how it works just to hammer the square peg into the round hole.

The modern but unfortunately titanically heavy-weight replacements for Mechanize that I would suggest are:

  • phantomjs which provides a WebKit based javascript-centric way to interact with webpages (headlessly, which is a bonus) It's Qt based, but has solid release binaries and if you build from source it actually contains everything it needs to run without having to sync up with some specific version of Qt.

  • PySide bindings for QtWebKit which is nifty although there can be a bit of a learning curve but IMHO my favorite just because it's nice to be able to reach inside the browser and get my hands dirty to see whats going on.

  • WebKit also provides a nice (although, poorly supported by Python) interface where you can enable a websocket server in the browser and drive it over websockets using an API as defined in Inspector.json. Stock Chrome supports this out of the box. You can find more details on the Chrome developer website.

    So, pretty much WebKit heavy, has nothing to do with what you're asking about - but in the long run this is where you're going to end up to be able to really automatically navigate and scrape the web.

Community
  • 1
  • 1
synthesizerpatel
  • 27,321
  • 5
  • 74
  • 91