12

I tried to get HTML Source in the following way:

webBrowser1.Document.Body.OuterHtml;

but it does not work. For example, if the original HTML source is :

<html>
<body>
    <div>
        <ul>
            <li>
                <h3>
                    Manufacturer</h3>
            </li>
            <li><a href="/4566-6501_7-0.html?

filter=1000036_3808675_100021_10194772_">Sony </a>(44)</li>
            <li><a href="/4566-6501_7-0.html?

filter=1000036_108496_100021_10194772_">Nikon </a>(19)</li>
            <li><a href="/4566-6501_7-0.html?

filter=1000036_3808726_100021_10194772_">Panasonic </a>(37)</li>
            <li><a href="/4566-6501_7-0.html?

filter=1000036_3808769_100021_10194772_">Canon </a>(29)</li>
            <li><a href="/4566-6501_7-0.html?

filter=1000036_2913388_100021_10194772_">Olympus </a>(21)</li>
            <li class="seeAll"><a href="/4566-6501_7-0.html?

sa=1000036&filter=100021_10194772_" class="readMore">See all manufacturers </a></li>
        </ul>
    </div>
</body>
</html>

but the output of webBrowser1.Document.Body.OuterHtml is:

<body>
    <div>
        <ul>
            <li>
                <h3>
                    Manufacturer</h3>
                <li><a href="/4566-6501_7-0.html?filter=1000036_3808675_100021_10194772_">Sony </a>(44)
                    <li><a href="/4566-6501_7-0.html?filter=1000036_108496_100021_10194772_">Nikon </a>(19)
                        <li><a href="/4566-6501_7-0.html?filter=1000036_3808726_100021_10194772_">Panasonic
                        </a>(37)
                            <li><a href="/4566-6501_7-0.html?filter=1000036_3808769_100021_10194772_">Canon </a>
                                (29)
                                <li><a href="/4566-6501_7-0.html?filter=1000036_2913388_100021_10194772_">Olympus </a>
                                    (21)
                                    <li class="seeAll"><a class="readMore" href="/4566-6501_7-0.html?sa=1000036&amp;filter=100021_10194772_">
                                        See all manufacturers </a></li>
        </ul>
    </div>
</body>

as you can see, many </li> are lost.

is there a way to get HTML source in WebBrower control correctly? Note that in my application, I try to use WebBrowser to add coordinate info to every node and output its HTML source with coordinate info which is added as attributes of nodes.

anybody can do me a favor?

Rockycqu
  • 173
  • 1
  • 2
  • 8

5 Answers5

10

Try using DocumentText or DocumentStream properties.

VinayC
  • 47,395
  • 5
  • 59
  • 72
  • 1
    yes, both DocumentText and DocumentStream can return correct HTML source. But when I add some attributes to nodes in DOM tree( myIHTMLElement.setAttribute() ), the HTML source got by WebBrowser1.DocumentText does not contain any added attributes – Rockycqu Mar 02 '11 at 09:38
  • @Rockucqu, what about `InnerHtml` property - does that return correct html? – VinayC Mar 02 '11 at 10:15
5

Thank you all. My final solution is: first,using body.outlineHtml to get html source. because body.outlineHtml may miss end-tag for <li> and <td>, so the second step is using tidy to repair the HTML source. after these, we can get the HTML source without error

Rockycqu
  • 173
  • 1
  • 2
  • 8
2

have you tried WebBrowser1.DocumentText

V4Vendetta
  • 37,194
  • 9
  • 78
  • 82
  • yes, WebBrowser1.DocumentText return correct HTML source. But when I add some attributes to nodes in DOM tree( myIHTMLElement.setAttribute() ), the HTML source got by WebBrowser1.DocumentText does not contain any added attributes – Rockycqu Mar 02 '11 at 09:38
1

If you want to grab the entire HTML source of the WebBrowser control then use this - WebBrowser1.Document.GetElementsByTagName("HTML").Item(0).OuterHtml. This of course assumes you have properly formatted HTML and the HTML tag exists. If you want to narrow it down to just the body then obviously change the HTML tag to the BODY tag. This way you grab any and all changes after "DocumentText" has been set. Sorry, I'm a VB guy, convert as needed ;)

Justin Emlay
  • 904
  • 9
  • 10
-2

Have a look at this. WebBrowser on MSDN

Alternative you could use Webclient.DownloadString from System.Net (it also has WebClient.DownloadStringAsync...) Here is the description: WebClient on MSDN

  • 1
    in my application, I need to use WebBrowser to add coordinate info to every node and output its HTML source with coordinate info which is added as attributes of nodes. Webclient can not perform this task – Rockycqu Mar 02 '11 at 09:40