1

I am trying scrape a website using C#. At some point in the process, website returns a JavaScript page that I need to execute so that it will generate some arguments and then post a request using the generated arguments as query variables.

This is the JavaScript file https://jsfiddle.net/7aw5vr59/

The browser generated result file will look like the below:

<imimxxxyyy id="ActiveX"></imimxxxyyy><form action="/home/" method="post"><input name="TS013a5875_id" value="3" type="hidden"><input name="TS013a5875_cr" value="085d52524cab2800109920a8877032c63ff20a076afde32d3949a9c0cc832e2a409e921dbd0f04b390bc9a36f79f4d080873a7f6848948001fe9d70f9af2fa1f81ba0cb687810509e2df6f37950961d59dba504d18b2e08237af58ac5683f65a8b9a4c978624319575ee9b400ae2307cbb314a0f32ecca4464cdc6b2082f7352" type="hidden"><input name="TS013a5875_76" value="085d52524cab2800109920a8877032c63ff20a076afde32d3949a9c0cc832e2a409e921dbd0f04b390bc9a36f79f4d080873a7f68488b000c2ff7c505061da44dff5459af7ebe2f604b8d36bdeeeca3eead0e146af07190233b9414ca790443d2453827dc161e073eb63ed4d10c070e405848b2ccb2dc1c4412b93dff97f978c6f1caecff07f6d4c23e1ade1bfb2f715409cf4d5f1f91a826e092193a1407539ec35c80a0d82032163abc93f6876c7c1cecded7400c11873a90a0ad58c3d18b0a55b0a0430c50575d7f535fd9b414c06b1c3b11ab326b07356737269137f2610cf26df27c7e0bcd5" type="hidden"><input name="TS013a5875_86" value="085d52524cab2800109920a8877032c63ff20a076afde32d3949a9c0cc832e2a409e921dbd0f04b390bc9a36f79f4d080873a7f68486600098382373b7447eebb69eb2b508714f7fb748b827881d272fff290b8bcf8bef6184c2a8c9f1236e71539573e709a14a158df0bb128ca0ba6e196a5b4a979b28a93e07d7089584e53a1ae51612c25ee3012964be00bc312836a58d7543f2cd825f" type="hidden"><input name="TS013a5875_md" value="1" type="hidden"><input name="TS013a5875_rf" value="0" type="hidden"><input name="TS013a5875_ct" value="0" type="hidden"><input name="TS013a5875_pd" value="0" type="hidden"></form>

As you see at the end there are variables in the form starts with TS013a5875. I should do the same in my code. Can someone help me how can I do that.

I tried the below but no luck. Also, the application is very tightly coupled to add more external dependencies.

  1. Using Jurassic Engine
  2. ScrapySharp
  3. WebBrowser Class
P J S
  • 170
  • 2
  • 16
  • I would prefer to use an actual web browser i.e. Chrome or Firefox to do that so. And for scrapping, I would use Selenium Web Driver. – Adnan Umer Nov 07 '16 at 13:39
  • 1
    How about using something like selenium webdriver + phantomjs? – Hackerman Nov 07 '16 at 13:39
  • @AdnanUmer Can you give more details on Selenium Web Driver or any references where I can understand it more clearly? – P J S Nov 07 '16 at 13:43
  • Also, how smooth would be the selenium integration into a .net project? I am completely new to selenium, which means there is a lot of learning required. – P J S Nov 07 '16 at 13:44
  • 1
    @PJS http://scraping.pro/example-of-scraping-with-selenium-webdriver-in-csharp/ – Adnan Umer Nov 07 '16 at 13:47
  • @Hackerman that will be one step forward. But I'm not sure Selenium works with PhantoJS or not. – Adnan Umer Nov 07 '16 at 13:48
  • In one of my latest projects I use C#, Selenium + PhantomJS to do web scrapping, and also to modify the page content injecting javascript to the page and it works like a charm – Hackerman Nov 07 '16 at 13:52
  • @Hackerman Can you help me with any resources to achieve this with C#, Selenium + PhantomJS to refer – P J S Nov 07 '16 at 13:55
  • Yeah no problem with that :) – Hackerman Nov 07 '16 at 14:01
  • http://stackoverflow.com/questions/25417913/c-sharp-example-of-using-phantomjs-webdriver-executephantomjs-to-filter-out-imag – Hackerman Nov 07 '16 at 14:16

1 Answers1

0

The website you are scraping probably uses a anti-scraping technology called BIG IP developed by F5.com.

You should use a browser that is able to execute javascript and that have some real capabilities, like rendering canvas. You can try a headless browser like PhantomJS, but it'll probably not work.

barbolo
  • 3,807
  • 1
  • 31
  • 31