0

How do I write a Mechanize scraper to get the content from every HTML tag on a web page? Or do I need to convert the page to a string and use regex to get all the content between \<.*?\> and \<\/.*?\>?

Username
  • 3,463
  • 11
  • 68
  • 111

2 Answers2

2

To find more information regarding writing a web scraper with Mechanize take a look at the following tutorials:

Also keep in mind that mechanize uses the Nokogiri gem to do its underlying scraping. If you are not attached to Mechanize consider just using Nokogiri to parse the HTML tags.

Do not convert the page to a string and use regex to get the HTML content. See this answer for more information on why that is a bad idea.

--Edit--

As @pguardiario mentioned in the comment below, the code to get all the content for each tag is page.search(*).map &:text

Community
  • 1
  • 1
2016rshah
  • 671
  • 6
  • 19
  • @Зелёный fair enough, I have improved the answer to provide a more useful answer. – 2016rshah Jul 07 '15 at 14:59
  • Thanks. I'm literally trying to go through the content of every tag on a we page. Is there a way to do this with Mechanize/Nokogiri? – Username Jul 07 '15 at 15:02
  • 1
    The short answer is yes there is a way to do that. Do you need to separate the content into a data structure based on which tag it was in or do you just want the plain text all jumbled together? – 2016rshah Jul 07 '15 at 15:03
  • (Also if my answer helped don't forget to click the green check to accept it) – 2016rshah Jul 07 '15 at 15:04
  • For each tag, I want the content in plain text. I have not yet found a way to do this with Mechanize or Nokogiri. – Username Jul 07 '15 at 15:11
  • 1
    That would be `page.search(*).map &:text` – pguardiario Jul 08 '15 at 00:27
1

Do you limited only to mechanize? Maybe, you can try to use watir or pure selenium to get web page with all tags in one object.

Victor Ch.
  • 66
  • 6