0

there have been quiet a few posts on that issue but it seems none realy answer the question I have.

I use TIdHttp to load the source code of this website: http://www.nationalgeographic.com/

I try to extract some data but realized that the data is generated by a script. There is a script on in the source code and a few links to external js files.

How could i possibly run some or all of the scripts on the page and get the source code generated ?

I am using this part in a secondary thread and would like to avoid using a WebBrowser component.

I could extract the scripts or links from the Idhttp generated source code, but running a js file with idhttp.get(*.js) but I presume that would probably be too simple to work.

Rob Kennedy
  • 161,384
  • 21
  • 275
  • 467
David K.
  • 39
  • 9
  • 1
    you will need to run the script, which is the exact thing a browser would do, so a webbrowser component seems more appropriate in your case... – whosrdaddy Sep 03 '15 at 09:55
  • I have open the webpage in different Browsers (Opera, IE and Firefow). When I have a look at the source code in each Browser, all show me the script but not the content which I am interested in. How could a Browser get me any further ? – David K. Sep 03 '15 at 10:41
  • For instance `TWebBrowser` control is a wrapper around Internet Explorer and as such can run JavaScript. And allows you to manipulate with the final document (through MSHTML). – TLama Sep 03 '15 at 11:03
  • I tried to get the source of the WebBrowser using IPersistStreamInit, which is incomplete as allready lined out above. How to get the content of interest via MSHTML ? – David K. Sep 03 '15 at 11:09
  • That's too broad to answer. If you were specific, you might get a specific solution. – TLama Sep 03 '15 at 11:15
  • I provided the address of the page which I want to get the source code above. How deeper would you need to go ? – David K. Sep 03 '15 at 11:16
  • That site seems to be dynamic (it seems to load when you scroll). So I'm afraid this will require deeper analysis. Anyway, why do you need the source code of the site ? With MSHTML you can extract only certain elements, e.g. links, images, paragraphs etc. Isn't that what you are looking for rather than page source code ? – TLama Sep 03 '15 at 11:42
  • Yes, I would like to get the Headings, the text and the associated pictures to generate a sort of RSS Feed for each article. – David K. Sep 03 '15 at 11:54
  • 2
    You don't want the source code. You want to explore the DOM. Chrome's developer tools let you do that interactively, which might help you determine what code to write in your program. – Rob Kennedy Sep 03 '15 at 12:32
  • "You don't want the source code. You want to explore the DOM" Wow ! And what if I say that I want the source of the page ? We are diverting from the problem: How to get the source code of a webpage. What to do with that later should not be part of that post ;-) – David K. Sep 03 '15 at 12:58
  • why are you not using the [offical way](http://press.nationalgeographic.com/connect/)? – whosrdaddy Sep 03 '15 at 14:27
  • From the documentation, I see that the DOM is used for parsing the website, which is not what I want. It may be possible to use the DOM to parse the hold page and then rebuild the source into a single string. That seems to be a very resource and time consuming procedure, parsing and unparsing the page source just to get the script content. Any other solution ? – David K. Sep 03 '15 at 14:28
  • I am surprised how many solutions are proposed which does answer other questions which had not been asked. – David K. Sep 03 '15 at 14:45
  • Your problem is that this specific page modifies the page through the means of javascript, so you need a browser. you can use a [headless browser](http://stackoverflow.com/questions/18539491/headless-browser-and-scraping-solutions) if you want.. – whosrdaddy Sep 03 '15 at 14:46
  • 1
    You have already loaded the source code and confirmed that the information you want isn't there. Thus, you don't want the source code. You're under the mistaken impression that the page's JavaScript modifies the source code. It doesn't do that, though. It directly modifies the browser's internal representation of the page's structure, which is presented as the DOM. Thus, again, you want access to the DOM. The DOM offers structured access to the page's data. Wouldn't you prefer that over parsing HTML, anyway? Parsing is always a pain. – Rob Kennedy Sep 03 '15 at 15:28

1 Answers1

1

Finally, the answer have been very basic :

document := webBrowser.Document as IHTMLDocument2; result := document.body.innerHTML;

That retrieves the source code and include the content generated dynamically at runtime by scripts.

David K.
  • 39
  • 9