42

I am (was) a Python developer who is building a GUI web scraping application. Recently I've decided to migrate to .NET framework and write the same application in C# (this decision wasn't mine).

In Python, I've used the Mechanize library. However, I can't seem to find anything similar in .NET. What I need is a browser that will run in a headless mode, which has the ability to fill out forms, submit them, etc. JavaScript parser is not a must, but it would be quite useful.

Yahia
  • 69,653
  • 9
  • 115
  • 144
Bo Milanovich
  • 7,995
  • 9
  • 44
  • 61

3 Answers3

36

There are some options:

  • WebKit.Net (free)

  • Awesomium
    It is based on Chrome/WebKit and works like a charm. There is a free license available but also a commercial one and if need be you can buy the source code :-)

  • HTML Agility Pack (free) (An HTML Parser library, NOT a headless browser)
    This helps with extracting information from HTML etc. and might be useful in your case (possibly in combination with HttpWebRequest)

ABS
  • 2,626
  • 3
  • 28
  • 44
Yahia
  • 69,653
  • 9
  • 115
  • 144
  • 2
    Thanks. Hmm, correct me if I am wrong, but don't all these (or at least the first two) require a creation of an user interface (I figured that from reading the docs)? What I need is a headless browser, so one without GUI. – Bo Milanovich Apr 15 '12 at 11:25
  • 1
    @Deusdies NO - at least the second (Awesomium) and third (HTML Agility Pack) link work completely headless... with the first link I am not sure... – Yahia Apr 15 '12 at 11:26
  • @Deusdies for example Awesomium - according to the docs (see http://awesomium.com/docs/1_6_5/sharp_api/) it gives you pixels IF you want want to render them in a UI, if not there is not need to. – Yahia Apr 15 '12 at 11:29
  • @Deusdies fro example HTML Agility Pack: it does not have any UI at all - you can take any string (from the web via WebRequest or a local file or whatever) and analyze its content for forms/fields etc. – Yahia Apr 15 '12 at 11:30
  • Ah thanks. Yeah I saw them generating the UI, didn't realize it was optional. I'll wait for some more replies before marking the answer. – Bo Milanovich Apr 15 '12 at 11:45
  • 65
    For anyone else who came here via google, HTML Agility Pack is not a headless browser, it's simply a html parser to be used in conjunction with a webclient. A headless browser does a lot more than that 9 – nagytech Mar 07 '14 at 22:03
  • There must be some other options? Python had Selenium with PhantomJS. – User Aug 30 '14 at 21:34
  • One additional option ... I was facing a similar problems and wrote a wrapper for the .NET WebBrowser. For anyone else whose interested in a fairly simple headless browser for .NET I posted the code to GitHub and made it available via nuget, for more information see https://github.com/LeastOne/WebBrowserWaiter – LeastOne Nov 01 '14 at 06:41
  • 4
    Since this question was answered, Awesomium looks dead. http://answers.awesomium.com/questions/6880/does-the-project-is-still-supported-developed.html – Troy Witthoeft Oct 24 '16 at 21:51
  • If it is simply for web scraping then you can use ChromeDriver in Headless mode by passing the value --headless as argument to ChromeOptions – Naveen Dennis May 09 '18 at 05:47
14

More solutions:

  • PhantomJS - full featured headless web browser. Often used in pair with Selenium which allows you to access the browser from .NET application.
  • Optimus (nuget package)- lightweight headless web browser. It's in beta but it is sufficient for some cases.

I used to use both for web testing. But they are also suitable for web scraping.

0xced
  • 25,219
  • 10
  • 103
  • 255
Knyaz
  • 184
  • 1
  • 5
  • 1
    A link to a potential solution is always welcome, but please add context around the link so your fellow users will have some idea what it is and why it’s there. Always quote the most relevant part of an important link, in case the target site is unreachable or goes permanently offline. Take into account that being barely more than a link to an external site is a possible reason as to [Why and how are some answers deleted?](http://stackoverflow.com/help/deleted-answers) – Vladimir Vagaytsev Jul 19 '16 at 11:51
  • Thank you guys. I've updated my answer. – Knyaz Jul 20 '16 at 15:33
  • 1
    Excessive promotion of a specific product/resource may be perceived by the community as **spam**. Take a look at the [help], specially [What kind of behavior is expected of users?](//stackoverflow.com/help/behavior)'s last section: _Avoid overt self-promotion_. You might also be interested in [How do I advertise on Stack Overflow?](//stackoverflow.com/help/advertising). – Blue Jul 20 '16 at 16:56
  • @Knyaz - do you have any working example with Selenium? Let's say, when some javascript is run and get return from that javascript. – FrenkyB Aug 30 '17 at 11:23
  • @Knyaz, Optimus support required. Please check optimus.net@yandex.ru. – AsValeO Apr 22 '18 at 15:56
  • Hi @Knyaz, Optimus will work with Xamarin as well? I need an headless browser which is able to stay alive when the main thread will be GC. – Mutu A. Jun 19 '20 at 06:02
  • i tried optimus, but it didnt load the javascript. anything that i should do for it to run javascript ? – khalil Mar 02 '22 at 07:03
5

You may be after TrifleJS (currently in beta), or something similar using the .NET WebBrowser class which communicates with IE via a windowless ActiveX/COM API.

You'll essentially be running a fully fledged browser (not a http request wrapper) using Internet Explorer's Trident engine, if you are not interested in the JavaScript API (a port of phantomjs) you may still be able to use some of the C# codebase to get around key concepts (custom headers, cookies, script execution, screenshot rendering etc).

Note that this can also emulate different versions of IE depending on what you have installed.

enter image description here

Steven de Salas
  • 20,944
  • 9
  • 74
  • 82