9

I need to access the DOM of the HTML document after executing javascript on the page. I have the below code which connects to the URL and gets the document. The problem is that it never get the DOM after modified with javascript

public class CustomBrowser
{
    public CustomBrowser()
    {
        //
        // TODO: Add constructor logic here
        //
    }

    protected string _url;
    string html = "";
    WebBrowser browser;

    public string GetWebpage(string url)
    {
        _url = url;
        // WebBrowser is an ActiveX control that must be run in a
        // single-threaded apartment so create a thread to create the
        // control and generate the thumbnail
        Thread thread = new Thread(new ThreadStart(GetWebPageWorker));
        thread.SetApartmentState(ApartmentState.STA);
        thread.Start();
        thread.Join();
        string s = html;
        return s;
    }

    protected void GetWebPageWorker()
    {
        browser = new WebBrowser();
        //  browser.ClientSize = new Size(_width, _height);
        browser.ScrollBarsEnabled = false;
        browser.ScriptErrorsSuppressed = true;
        //browser.DocumentCompleted += browser_DocumentCompleted;
        browser.Navigate(_url);

        // Wait for control to load page
        while (browser.ReadyState != WebBrowserReadyState.Complete)
            Application.DoEvents();

        Thread.Sleep(5000);


        var documentAsIHtmlDocument3 = (mshtml.IHTMLDocument3)browser.Document.DomDocument;

        html = documentAsIHtmlDocument3.documentElement.outerHTML; 


        browser.Dispose();
    }


}

The DOM from google chrome developer tool

The DOM I get in my code

I hope that someone can help me with this problem

MrMister
  • 2,456
  • 21
  • 31
Abubakr A.Hafiz
  • 269
  • 3
  • 10
  • Please don't post code as images. Post code as text. Also, you should be using events to find when the navigation completes, not a `while` loop with `Application.DoEvents()` or `Thread.Sleep()`. – Heretic Monkey Feb 27 '17 at 21:21
  • I added the code as text, the images to clarify the difference between the dom in browser and what I get – Abubakr A.Hafiz Feb 27 '17 at 22:17
  • How about using an alternative control? E.g. http://stackoverflow.com/questions/790542/replacing-net-webbrowser-control-with-a-better-browser-like-chrome – user1946932 Feb 28 '17 at 21:52
  • I tested your code with http://idealtackle.com as a url parameter, there is a image that changes every time page loaded trough javascript, and after loading it two different time there was a two different image loaded and there was not any problem, if u want to see that for yourself, put a break point on browser.Dispose(); then look at html in quick watch in line 121, BACKGROUND-IMAGE: changes every time you load it. SO my guess is it should be because of your browser version or security for running javascript or something like that. – Saman Mar 10 '17 at 05:37
  • could u please give us your url, so i check with that too? – Saman Mar 10 '17 at 05:39
  • Here is the link http://autoindex-eg.com/test/ – Abubakr A.Hafiz Mar 10 '17 at 19:21

3 Answers3

3

If the client-side script is indeed executing in IE7 as you say, the issue might be purely timing. Even after the document's load is completed, you cannot know exactly when the JS scripts will be executed. Waiting 5 seconds before trying to reach for the documentElement sounds like a good idea in theory; in practice, the element might exist before that. Or, perhaps the network is slow and merely fetching jQuery script takes 5 seconds on its own.

I suggest to test for the existence of the element you are looking for (an img tag, as the case may be). Something along the lines of

while (browser.Document.GetElementsByTagName("img").Count == 0) {
    Application.DoEvents();
}

This way, you wouldn't need the Thread.Sleep line.

MrMister
  • 2,456
  • 21
  • 31
  • The script will be used to download images from any given URL, not a specific one, I think this will not work in my case. – Abubakr A.Hafiz Mar 13 '17 at 13:10
  • How come? I did not regard to any specific URL in my answer. – MrMister Mar 13 '17 at 14:07
  • What I'm looking for is get the entire document dom after any ajax or client side scripts execution. I'm not looking for specific element here, I want to download all the images on any given html page including the background images for any tag, I already done with that, except that I can not download the images loaded by ajax request or by client side script. – Abubakr A.Hafiz Mar 13 '17 at 18:31
2

I cannot see the js being executed here but I imagine you could find exactly what element is being updated and attach An event handler to when onprpertychange event like a solution that is given here: C# WebBrowser control -- Get Document Elements After AJAX?

If js is flipping an element by class instead of idea then you could borrow logic from here: How to select a class by GetElementByClass and click on it programmically

Community
  • 1
  • 1
Travis Acton
  • 4,292
  • 2
  • 18
  • 30
1

Check how the page renders in IE7. I guess the tag you are missing is added with jQuery, and the jQuery version 2.2.4 on the page does not support IE7. I think the WebBrowser class does wrap around IE7, even if you have newer version of IE on your PC.

If you own the page, try adding the jQuery migrate plugin.

  • Not sure if the following would help?: https://www.cyotek.com/blog/configuring-the-emulation-mode-of-an-internet-explorer-webbrowser-control , https://blogs.msdn.microsoft.com/patricka/2015/01/12/controlling-webbrowser-control-compatibility/ , https://weblog.west-wind.com/posts/2011/may/21/web-browser-control-specifying-the-ie-version and http://stackoverflow.com/questions/17922308/use-latest-version-of-internet-explorer-in-the-webbrowser-control – user1946932 Feb 28 '17 at 00:39
  • The page is rendered correctly in IE7 and I changed the jQuery to 1.7.1 but no thing changed. – Abubakr A.Hafiz Feb 28 '17 at 13:17
  • I did notice the div class names in the black screenshot above are not in double quotes and the images2.jpg URL isn't either if that means anything. I read that XHTML requires quotes. – user1946932 Feb 28 '17 at 20:53