I am scraping webpages and when I run my scraper application on a windows XP box with IE 8 (it is the max Windows XP supports) it returns different HTML source from webBrowser.Body.OuterHtml than it does when running on Windows 7 with IE9. Does anyone know how to get the raw html unmodified from the webbrowser control??? I know IE modifies HTML so I want to know how to get the raw html returned from the web server. It's annoying because I write the scraper on my windows 7 dev box and then it won't work when I host it on a Windows XP box. If you answer don't tell me to use WebClient and download the page, I want to easily support browsing pages and not have to worry about other little webpage stuff that is taken care of by a webbrowser control. I am using webbrowser control for a reason. Does webBrowser.DocumentText return the raw html or is this still modified html be IE?
Asked
Active
Viewed 391 times
-2
-
Have you looked into the compatibility and quirk modes? http://stackoverflow.com/questions/2055271/webbrowser-control-ie8-compatibility-mode-on-off-switch, http://stackoverflow.com/questions/646742/how-to-programmatically-turn-off-quirks-mode-in-ie8-webbrowser-control – Jeremy Thompson Jul 15 '12 at 07:11
-
It seems like your question is "I know how to download a page from the server with WebClient, but I don't feel like doing it. Please tell me a way to use a WebBrowser, which is designed for showing a webpage to the user and is not designed for making raw HTML available to the programmer, to get raw HTML from the server." Why the aversion to WebClient? – Adam Mihalcin Jul 15 '12 at 07:12
-
I am using the webbrowser control to handle cookies, sessions, and so I can fill in input fields and submit using POST's and also handle paging links etc easier. – kyleb Jul 15 '12 at 16:12
1 Answers
2
Fundamentally you have two opposing concerns:
- You want to get the original source, unmodified by anything the browser can do
- You want to let the browser do things, as you apparently find it useful. (You've said you're using
WebBrowser
"for a reason" but you haven't actually told us what that reason is.)
If you really need to use WebBrowser
for some reason, you might want to fetch each page twice: once within the browser (so that it can do whatever you need it to) and once with WebClient
(so that you can get the response without any messing).
It's also possible that disabling scripting within the browser control would do everything you need it to - but as you haven't given us the reason for using the browser control in the first place, that may not help...

Jon Skeet
- 1,421,763
- 867
- 9,128
- 9,194