4

I'm trying to scrape a site where I can only rely on classes and element hierarchy to find the right nodes. But using Mechanize::Page#search returns Nokogiri::XML::Elements which I can't use to fill and submit forms etc.

I'd really like to use pure CSS selectors but matching for classes seems to be pretty straight forward with the various _with methods too. However, matching things like :not(.class) is pretty verbose compared to simply using CSS selectors while I have no idea how to match for element hierarchy.

Is there a way to convert Nokogiri elements back to Mechanize objects or even better get them straight from the search method?

raphinesse
  • 19,068
  • 6
  • 39
  • 48
  • 1
    Do you have an example of a form field you cannot identify with Mechanize? – Mark Thomas Feb 05 '12 at 01:53
  • @mark All the forms and form fields on the relevant page have randomly generated IDs and names. There are of course no elements you can't identify since you could always do sth. like `page.forms[3]`. But retrieving that form with `page.search '.main-content form'` is more meaningful IMHO and probably not so prone to break when something on the site changes. – raphinesse Feb 05 '12 at 13:31
  • I believe you can find your answer in [this old answer](http://stackoverflow.com/questions/2469117/nokogiri-error-undefined-method-radiobutton-with-why/6003166#comment11504418_6003166). – Phrogz Feb 05 '12 at 14:21

1 Answers1

7

Like stated in this answer you can simply construct a new Mechanize::Form object using your Nokogiri::XML::Element retrieved via Mechanize::Page#search or Mechanize::Page#at:

a = Mechanize.new
page = a.get 'https://stackoverflow.com/'

# Get the search form via ID as a Nokogiri::XML::Element
form = page.at '#search'

# Convert it back to a Mechanize::Form object
form = Mechanize::Form.new form, a, page

# Use it!
form.q = 'Foobar'
result = form.submit

Note: You have to provide the Mechanize object and the Mechanize::Page object to the constructor to be able to submit the form. Otherwise it would just be a Mechanize::Form object without context.


There seems to be no central utility function to convert Nokogiri::XML::Elements to Mechanize elements but rather the conversions are implemented where they are needed. Consequently, writing a method that searches the document by CSS or XPath and returns Mechanize elements if applicable would require a pretty big switch-case on the node type. Not exactly what I imagined.

raphinesse
  • 19,068
  • 6
  • 39
  • 48
  • Is there a way to do this and get teh equivalent of a Page object rather than a form. Tried Mechanize::Page.new and it didn't work... the Mech syntax is just much easier to work with than Nokogiri – Carpela Sep 16 '15 at 10:16
  • @KeiranBetteley Could you please elaborate on this? I don't understand why you would need a new page object? – raphinesse Sep 16 '15 at 10:57
  • What I want to do is to take a subset of the webpage, e.g. page.search("table.results"), and then use mechanise methods on it. e.g. result = page.search("table.results").first result = result.covert_to_mechanize_object links = results.links Does that make any more sense? I wonder if you could create a fake page with the original header information but just the particular section of the DOM as the body – Carpela Sep 16 '15 at 12:00
  • @KeiranBetteley If you need to work with generic subsets of the page, e.g. *nodes*, then I would suggest to actually use `Nokogiri::XML::Node` for it. If you want to go with the Mechanize API, then I can't help you with it unfortunately. It might be possible, but that will probably include quite some code reading, monkey patching and relying on undocumented behavior. Good luck ;) – raphinesse Sep 16 '15 at 13:01