11

How can I scrape data that are dynamically generated by JavaScript in html document using C#?

Using WebRequest and HttpWebResponse in the C# library, I'm able to get the whole html source code as a string, but the difficulty is that the data I want isn't contained in the source code; the data are generated dynamically by JavaScript.

On the other hand, if the data I want are already in the source code, then I'm able to get them easily using Regular Expressions.

I have downloaded HtmlAgilityPack, but I don't know if it would take care of the case where items are generated dynamically by JavaScript...

Thank you very much!

bjb568
  • 11,089
  • 11
  • 50
  • 71
user3213711
  • 145
  • 2
  • 3
  • 11
  • You'll have to run it through a JavaScript engine of some sort. Maybe something like [Awesomium](http://www.awesomium.com/)? – Mike Christensen Jun 09 '14 at 23:33
  • Grab a look here: http://stackoverflow.com/questions/18539491/headless-browser-and-scraping-solutions – sagibb Jun 09 '14 at 23:38

2 Answers2

12

When you make the WebRequest you're asking the server to give you the page file, this file's content hasn't yet been parsed/executed by a web browser and so the javascript on it hasn't yet done anything.

You need to use a tool to execute the JavaScript on the page if you want to see what the page looks like after being parsed by a browser. One option you have is using the built in .net web browser control: http://msdn.microsoft.com/en-au/library/aa752040(v=vs.85).aspx

The web browser control can navigate to and load the page and then you can query it's DOM which will have been altered by the JavaScript on the page.

EDIT (example):

Uri uri = new Uri("http://www.somewebsite.com/somepage.htm");

webBrowserControl.AllowNavigation = true;
// optional but I use this because it stops javascript errors breaking your scraper
webBrowserControl.ScriptErrorsSuppressed = true;
// you want to start scraping after the document is finished loading so do it in the function you pass to this handler
webBrowserControl.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(webBrowserControl_DocumentCompleted);
webBrowserControl.Navigate(uri);

private void webBrowserControl_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
    HtmlElementCollection divs = webBrowserControl.Document.GetElementsByTagName("div");

    foreach (HtmlElement div in divs)
    {
        //do something
    }
}
Pandepic
  • 715
  • 1
  • 6
  • 9
  • Thank you very much. Could you give some tips on which methods/functions in web browser control I'm going to need (to navigate, to load, and to query)? Thanks again. :) – user3213711 Jun 10 '14 at 19:34
  • Is there a way to use WebBrowser in non-UI program? I need to parse a webpage, which is partially generated by javascript, but I don't need UI. – Spook Feb 03 '15 at 10:04
  • Hello Pandepic, Is there a way to do this in MVC? I know I can use Iframes, but many sites are not allowing cross. – Kadaj Jan 17 '17 at 16:11
  • Is there any way that I can do from console application only ?? – Rakesh Yadav Jun 08 '17 at 02:28
  • It's not a good idea because .Net WebBrowser will cache your previous actions, and it takes up a lot of memory on your computer. Therefore, more call will cause it to crash. – MiMFa Apr 02 '20 at 07:21
4

You could take a look at a tool like Selenium for scraping pages which has Javascript.

http://www.andykelk.net/tech/headless-browser-testing-with-phantomjs-selenium-webdriver-c-nunit-and-mono

vikramsk
  • 81
  • 3