0

Lately I'm trying to scrap some data from the web page using C#. My problem is, that in C# when I'm using WebBrowser object to manipulate with the web page, when I navigate to my web page in body I only get:

<body>
    <script language="javascript"   src="com.astron.kapar.WebClient/com.astron.kapar.WebClient.nocache.js"></script>
</body>

But if you go on actual web page https://kapalk1.mavir.hu/kapar/lt-publication.jsp?locale=en_GB and look the source you see there is some tables in body probably because browser loads scripts.

My question is, What is the way in C# to manipulate or deal with that kind of web page? For example to choose some dates and get some data? Is there any good library?

Sorry for bad English.

Ghoul Fool
  • 6,249
  • 10
  • 67
  • 125
David
  • 38
  • 7
  • 1
    I believe that a possible explanation is that the website filters for user agent and returns to you a different content whether you are using browser or not. I don't have `WebBrowser` API at hand but may you try to fool the `User-Agent` header to see what it returns? – usr-local-ΕΨΗΕΛΩΝ Jul 13 '15 at 11:58
  • Update: no, it's like that. I opened with Firefox and looked at the source with CTRL+U and found the very same within the body. The Javascript generates the HTML on load, and is also minified (which means partially obfuscated). You may want to reverse engineer their APIs and make meaningful requests – usr-local-ΕΨΗΕΛΩΝ Jul 13 '15 at 12:01

2 Answers2

0

You need to use either headless IE, or headless WebKit.

These questions might also be relevant.

Headless browser for C# (.NET)?

c# headless browser with javascript support for crawler

Community
  • 1
  • 1
Nitesh Patel
  • 631
  • 3
  • 10
0

If you are familiar with javascript, one good solution for scrapping javascript-driven site would be casperjs.

I find casperjs really easy to work with for scrapping javascript-heavy site.

  1. Write a casperjs script to scrap the site with css selectors and send your desired output as JSON to stdout using JSON.Stringify.
  2. Invoke casperjs from C# using ProcessStartInfo. Read from stdout and serialize the json back to POCO.
Misterhex
  • 929
  • 1
  • 11
  • 19