How to extract the dynamically generated HTML from a website

Question

Is it possible to extract the HTML of a page as it shows in the HTML panel of Firebug or the Chrome DevTools?

I have to crawl a lot of websites but sometimes the information is not in the static source code, a JavaScript runs after the page is loaded and creates some new HTML content dynamically. If I then extract the source code, these contents are not there.

I have a web crawler built in Java to do this, but it's using a lot of old libraries. Therefore, I want to move to a Rails/Ruby solution for learning purposes. I already played a bit with Nokogiri and Mechanize.

score 1 · Accepted Answer · edited Mar 19 '20 at 04:18

If the crawler is able to execute JavaScript, you can simply get the dynamically created HTML structure using document.firstElementChild.outerHTML.

Nokogiri and Mechanize are currently not able to parse JavaScript. See "Ruby Nokogiri Javascript Parsing" and "How do I use Mechanize to process JavaScript?" for this.

You will need another tool like WATIR or Selenium. Those drive a real web browser, and can thus handle any JavaScript.

score 0 · Answer 2 · edited Mar 19 '20 at 04:21

0

You can't fetch the records coming from the database side. You can only fetch the HTML code which is static.

JavaScript must be requesting the records from the database using a query request which can't be fetch by the crawler.

edited Mar 19 '20 at 04:21

the Tin Man

158,662
42
215
303

answered Jul 21 '14 at 12:07

Jeet

1,350
1
15
32

Even the Javascript that is inside the HTML?, the data is inside the HTML just in the – Mauro M Jul 21 '14 at 12:21
What does this javascript will do..? is it fetching records from database..? and one more question that is your crawler is able to fetch javascript code ..? – Jeet Jul 21 '14 at 12:24
Look at this example: http://loja.puket.com.br/pijamas/feminino-adulto/pijama/conjunto-longo-coruja-030600572 its a Brazilian website, so just look at the dropdown named "Tamanho", as you can see theres 3 sizes you can choose, if you look at them with the inspector tool you can actually see them inside – Mauro M Jul 21 '14 at 12:31
The question and JavaScript execution are unrelated to databases. – Sebastian Zartner Jul 23 '14 at 09:49

How to extract the dynamically generated HTML from a website

2 Answers2