c# Webbrowser control cannot get generated HTML source

Question

I am trying to get data (which is generated by scripts) and I am using webbrowser control applied the introduction from: C# webbrowser Ajax call

My 1st main code is:

webBrowser1.Navigate("https://mobile.bet365.com/#type=Coupon;key=1-1-13-33977144-2-8-0-0-1-0-0-4100-0-0-1-0-0-0-0-0-0-0-0;ip=0;lng=1;anim=1");
while (webBrowser1.ReadyState != WebBrowserReadyState.Complete)
{
    System.Threading.Thread.Sleep(10);
    Application.DoEvents();
}
File.WriteAllText(@"C:\pagesource.txt", webBrowser1.DocumentText);

The page source I got is not what the browser showed. When I modify the code like below:

webBrowser1.Navigate("https://mobile.bet365.com/#type=Coupon;key=1-1-13-33977144-2-8-0-0-1-0-0-4100-0-0-1-0-0-0-0-0-0-0-0;ip=0;lng=1;anim=1");
while (webBrowser1.ReadyState != WebBrowserReadyState.Complete)
{
    System.Threading.Thread.Sleep(10);
    Application.DoEvents();
}
MessageBox.Show("Loading completed");
File.WriteAllText(@"C:\pagesource.txt", webBrowser1.DocumentText);

and of course I have to press OK when the dialog is shown. The page source is correct now.

I don't understand how can it be like that. And I just want to get the page source automatically (without any clicks or user actions).

Is the webbrowser control required or do you simply want to get the document source? — dsdel, Mar 18 '18 at 20:04
webbrowser control does not require. I try to work around and find out that but dont know how to resolve it — jimmy, Mar 19 '18 at 04:38
Is it possible that your url contains sensitive data? (the key attribute) Just noted when testing... — dsdel, Mar 19 '18 at 08:01
Not really, it's published. It shows the schedule of football matches in bet365 website. (key data contains ids) — jimmy, Mar 19 '18 at 08:08

score 0 · Answer 1 · answered Mar 19 '18 at 07:06

Therefor webbrowser is not required I would try switching to a different method of obtaining the page source (also avoiding the overhead of the webbrowser control).

Please note, that reading HTML source is very hard - as soon as the page layout is changed or additional javascript scripts kick in you can get into problems. For retrieving data from web pages you should search for a rss feed eg. which you can parse better than the html page source.

However I could not test my following code due to your mentioned url is currently undergoing maintenance. I tested it again my own page and it worked there. Naturally, on my own page there is not so much javascript like on your url.

Below I have shown 3 different methods of obtaining the page source:

        string pageSource1 = null, pageSource2 = null, pageSource3 = null;
        try
        {
            using (System.Net.WebClient webClient = new System.Net.WebClient())
            {
                // perhaps fake user agent?
                webClient.Headers.Add("USER_AGENT", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36 OPR/51.0.2830.55");

                //
                // option 1: using webclient download string (simple call)
                pageSource1 = webClient.DownloadString(url);

                //
                // option 2: getting a stream... (if you prefer using a stream, eg. not reading the whole page until the end)
                var webClientStream = webClient.OpenRead(url);
                if (webClientStream != null)
                {
                    using (System.IO.StreamReader streamReader = new System.IO.StreamReader(webClientStream))
                    {
                        pageSource2 = streamReader.ReadToEnd();
                    }
                }
            }

            //
            // option3: using webrequest (with webrequest/webresponse you can rebuild the browser behavior eg. walking pages)
            System.Net.WebRequest webRequest = System.Net.WebRequest.Create(url);
            webRequest.Method = "GET";

            var webResponse = webRequest.GetResponse();
            var webResponseStream = webResponse.GetResponseStream();
            if (webResponseStream != null)
            {
                using (System.IO.StreamReader streamReader = new System.IO.StreamReader(webResponseStream))
                {
                    pageSource3 = streamReader.ReadToEnd();
                }
            }
        }
        catch (System.Net.WebException exc)// for web
        {
            Console.WriteLine($"Unable to download page source: {exc.Message}");
            // todo - safely handle...
        }
        catch (System.IO.IOException exc)//for stream
        {
            Console.WriteLine($"Unable to download page source: {exc.Message}");
            // todo - safely handle...
        }

Hope it does help you!

Thank you, but it does not work. This page has content generated by javascript, so I think the only way to get pagesource (if possible) is using webbrowser control or Selenium. I am focusing on webbrowser control. — jimmy, Mar 19 '18 at 07:25
With firefox developer network plugin I found the following url: https://mobile.bet365.com/V6/sport/coupon/coupon.aspx?zone=0&isocode=DE&tzi=4&key=[[REPLACEWITHYOURKEY]]&ip=0&gn=0&cid=1&lng=1&ctg=1&ct=75&cts=210&clt=9996&ot=2 - this way I was able to download with webbrowser. Check ' — dsdel, Mar 19 '18 at 08:08
Thank you. Here is the data I supposed to retrieve. But my question is not only about the final result, but also about the way webbrowser control works: Why loading only be totally completed after I do some dumb things (show a dialog and close it). — jimmy, Mar 19 '18 at 17:39
I think that the problem is with the design of the WebBrowser control. I assume that your webbrowser is running in the main STA GUI thread. Please try get it running in another STA thread until you receive the documentcompleted event. There is an excellent explanation in the following [so post](https://stackoverflow.com/questions/4269800/webbrowser-control-in-a-new-thread) — dsdel, Mar 19 '18 at 21:14

c# Webbrowser control cannot get generated HTML source

1 Answers1