1

Is it possible to extract the HTML of a page as it shows in the HTML panel of Firebug or the Chrome DevTools?

I have to crawl a lot of websites but sometimes the information is not in the static source code, a JavaScript runs after the page is loaded and creates some new HTML content dynamically. If I then extract the source code, these contents are not there.

I have a web crawler built in Java to do this, but it's using a lot of old libraries. Therefore, I want to move to a Rails/Ruby solution for learning purposes. I already played a bit with Nokogiri and Mechanize.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Mauro M
  • 669
  • 1
  • 8
  • 25

2 Answers2

1

If the crawler is able to execute JavaScript, you can simply get the dynamically created HTML structure using document.firstElementChild.outerHTML.

Nokogiri and Mechanize are currently not able to parse JavaScript. See "Ruby Nokogiri Javascript Parsing" and "How do I use Mechanize to process JavaScript?" for this.

You will need another tool like WATIR or Selenium. Those drive a real web browser, and can thus handle any JavaScript.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Sebastian Zartner
  • 18,808
  • 10
  • 90
  • 132
0

You can't fetch the records coming from the database side. You can only fetch the HTML code which is static.

JavaScript must be requesting the records from the database using a query request which can't be fetch by the crawler.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Jeet
  • 1,350
  • 1
  • 15
  • 32
  • Even the Javascript that is inside the HTML?, the data is inside the HTML just in the – Mauro M Jul 21 '14 at 12:21
  • What does this javascript will do..? is it fetching records from database..? and one more question that is your crawler is able to fetch javascript code ..? – Jeet Jul 21 '14 at 12:24
  • Look at this example: http://loja.puket.com.br/pijamas/feminino-adulto/pijama/conjunto-longo-coruja-030600572 its a Brazilian website, so just look at the dropdown named "Tamanho", as you can see theres 3 sizes you can choose, if you look at them with the inspector tool you can actually see them inside – Mauro M Jul 21 '14 at 12:31
  • The question and JavaScript execution are unrelated to databases. – Sebastian Zartner Jul 23 '14 at 09:49