1

I'm using Delphi's TWebBrowser component to load up some web pages that I want to parse, and they use javascript (AJAX?) to render the user-visible HTML code. The well-documented methods of extracting the HTML from such pages returns a bunch of javascript rather than what the user sees. There are responses to queries here that go back to 2004 and they all return javascript rather than the user-visible HTML. I've seen a couple that suggest alternate ways to access the data, but I have not been able to get any of them to work, nor am I sure how to adapt the code.

My question is, when I load a web page into a TWebBrowser that's perfectly readable after being rendered inside of the TWebBrowser component, how can I extract the HTML that's ultimately rendered inside of that component that makes it visible, rather than the JS code that generates it?

In my case, I'm trying to load a Google Search Result page, but I've heard this is also an issue in lots of news sites like Wall Street Journal, WAPO, and NYTimes.

var
  url: string;
  d: OleVariant;
begin
  // enter something like "dentist in baltimore" in a Google search,
  // then copy the contents of the ADDRESS field that it generates and
  // paste it here:
  url := '... paste URL Google generates here ...';
  WebBrowser1.Navigate2( url, 0 {nav_flags} );
  // I have an OnNavigate2 handler here, but I'm guessing this works as well  
  d := WebBrowser1.Document;
  memo1.Lines.Text := d.documentElement.outerHTML;

The problem is, the memo contains ... and it's just a bunch of javascript in the HEAD. There's nothing there that resembles what's visible in the TWebBrowser or browser window that this search actually displays to the user.

David
  • 101
  • 1
  • 10
  • See: https://stackoverflow.com/questions/22517802/delphi-twebbrowser-get-html-source-after-ajax-load – Brian Aug 13 '19 at 11:59
  • Also don't scrap google search pages https://stackoverflow.com/questions/22657548/is-it-ok-to-scrape-data-from-google-results – Brian Aug 13 '19 at 12:05
  • The stuff on that page (the first comment) does not work today. I've already tried both suggestions and all I get is a bunch of javascript code. Even from the suggestions below that to grab outerHTML. That was from 2014. It may have worked even up until last fall in 2018, but it's not working today in mid-2019. – David Aug 13 '19 at 14:09
  • @TomBrunberg I request that in this case you actually try to reproduce what I'm explaining and take a look at the result. Open a browser window to google.com and enter something like "dentist in baltimore". Then copy the URL from the address bar and use that as your URL in WebBrowser1.Navigate2( the_url, 0 ); Extract outerHTML and tell me how that relates in any way to what is visible in the TWebBrowser viewport for this query. I don't see anything there that the user sees. Then maybe you'll understand the issue I'm pointing at. – David Aug 14 '19 at 18:26

1 Answers1

1

Someone in another forum suggested it's a timing issue, and to replace the OnNavigationComplete2 that I'm using with OnDocumentComplete. I've actually never seen or heard of OnDocumentComplete, nor have I seen it used in any examples. Certainly none that have been simplified to show everything inline so there are no timing issues that can occur.

But it turns out that this was the crux of the problem in this case, not outerHTML: you need to call an event that's triggered after all of the javascript has finished running, and I believed that the OnNavigationComplete2 did that. My bad.

David
  • 101
  • 1
  • 10
  • Good for you that you got it sorted out. This is a typical example of why it is so important to provide a complete [reprex] – Tom Brunberg Aug 15 '19 at 07:24
  • I hear you, but I would not have included the OnNavigationComplete2 handler in an example. Luckily someone else hit on it based on a comment I made. But, I've always used that handler to extract and process data from a website or search query. I guess we're looking at a new level of complexity in website page production that hasn't existed previously and causes unexpected timing / processing issues. Good to know. Thanks for your support. – David Aug 15 '19 at 08:12