I am trying to extract some information from a website. But when I navigate to it, it uses javascript to connect me to a server before dynamically loading a php-page. I can follow the sequence in Chrome with the developer tools. I figured it would be easiest to reproduce it in C# with the Webbrowser control and simply navigate to the website. Then the webbrowser control must contain all the javascript files, the text from the dynamically loaded php page and so on. But is this true and where in the control are they stored? I can't seem to find them.
-
Can you give a URL so I can try ? – Benoit Blanchon Oct 13 '13 at 09:01
-
If the page uses AJAX or other dynamic JavaScript, it's really non-deterministic to tell when you element is ready. At least, you should do the web-scrapping after `window.onload` has been fired for the page. [This sample](http://stackoverflow.com/a/19063643/1768303) may be a good starting point. – noseratio Oct 13 '13 at 11:10
1 Answers
Recreate the whole sequence diagram implemented in Chrome would be a lot of work. However, "extract some information from a website" is something that can be done quite easily.
Disclaimer: I assumed this question was for the WPF's WebBrower
control (it would be almost the same for WinForms)
You can get the HTMLDocument
once the page is loaded, using:
using mshtml; // <- don't forget to add the reference
public partial class MainWindow : Window
{
public MainWindow()
{
InitializeComponent();
browser.Navigate("http://google.com/");
browser.LoadCompleted += browser_LoadCompleted;
}
void browser_LoadCompleted(object sender, NavigationEventArgs e)
{
HTMLDocument doc = (HTMLDocument)browser.Document;
string html = doc.documentElement.innerHTML.ToString();
// from here, you should be able to parse the HTML
// or sniff the HTMLDocument (using HTML Agility Pack for instance)
}
}
From this HTMLDocument
, you have access to a lot of properties, including HTML elements, CSS styles and scripts. I invite you to put a break-point and check out what best fits your needs.
Nevertheless, since the page you want to load uses JavaScript to fill its content, the HTMLDocument
will probably not be complete a the time the LoadCompleted
is raise.
In that case, I suggest to use a timer to poll until the content is stable.
You could also use HTMLDocument
to inject your own JavaScript code, and call C# methods througth WebBrowser.ObjectForScripting
, but this is gonna be much more complicated and harder to maintain.

- 13,364
- 4
- 73
- 81