Parsing AJAX driven page

Question

I am trying to parse data from a page that is not filled in until after the page is finished loading. Because of this I cannot get a simple solution utilizing

while (wb.ReadyState != WebBrowserReadyState.Complete)
{
    Application.DoEvents();
}

to work. I have tried using the solution found at View Generated Source (After AJAX/JavaScript) in C# but I cannot figure out how to get it to wait for the post-loading data is downloaded. Please help! The data is automatically filled into the page after it is loaded, no user interaction is required. Thanks!

I just found Waiting for WebBrowser ajax content where the answer was to use a timer....I am not sure how to fix this using a timer instead of Thread.Sleep() (which blocks the thread completely), can someone help me understand the proper way to use this with a quick sample code? Thanks again

I am looking into the suggestion of calling the AJAX myself, but I think it would be better to use the timer. I am still looking for help on the subject. Thanks.

If that site owner wanted their data to be used by someone else - they would provide convenient API for that — zerkms, May 28 '12 at 23:34
It is grabbing the hours for a store location...not exactly top secret and not something they would provide an API for either...thanks though. — Brandon, May 28 '12 at 23:36
then just perform the same ajax request, without grabbing the whole page — zerkms, May 28 '12 at 23:36
I will look into how to do that thanks (this is my first encounter with using the webbrowser control, and thus my first bout with JS via that control). Thanks again. — Brandon, May 28 '12 at 23:40

score 1 · Answer 1 · answered May 29 '12 at 05:53

Take a look at the page you are dealing with with Firebug for Firefox. There is a "Net" tab which will allow you to see the actual raw data of all subsequent HTTP Ajax requests that are occurring while the page is loading (but after the initial part of the page has loaded).

By looking at this data it is quite likely you will be able to find JSON or other XML data that contains exactly what you are looking for in response to a GET request containing an ID or something of that nature.

Using a 'fake' browser as mentioned in that linked post should be considered a last resort because it will yield the worst performance on your end because you will likely be downloading and parsing a lot more data than necessary.

score 1 · Accepted Answer · answered May 29 '12 at 17:37

For my situation the following solved it:

while (wb.ReadyState != WebBrowserReadyState.Complete)
    Application.DoEvents();

while (wb.Document.GetElementById(elementId) != null && wb.Document.GetElementById(elementId).InnerHtml == null)
    Application.DoEvents();

The second while loop waits until a specified element is populated by the AJAX. In my situation, if an invalid store # is provided in the url, it forwards to a 404-type page. The first condition verified the element still exists on the page, which it won't if it gets sent to the 404 page. The second condition waits until the element is populated.

An interesting thing I found if that after the AJAX populates the page, wb.Document.InnerText and wb.DocumentStream still contain the downloaded html. Only wb.Document.InnterHTML is updated. In my situation I am creating an HtmlAgilityPack HtmlDocument from the results. Because the DocumentStream becomes outdated, I have to recreate my document like this:

htmlDoc.LoadHtml("<html><head><title>" + wb.DocumentTitle + "</title></head><body>" + wb.Document.Body.InnerHtml + "</body></html>");

In my situation I don't care about meta/scripts in the header, so this works. If someone cared about those things, they would obviously need to adapt that line of code for their own use.

Parsing AJAX driven page

2 Answers2

Linked