5

I'm investigating the possibility of porting the Python library Beautiful Soup over to .NET. Mainly, because I really love the parser and there's simply no good HTML parsers on the .NET framework (Html Agility Pack is outdated, buggy, undocumented and doesn't work well unless the exact schema is known.)

One of my primary goals is to get the basic DOM selection functionality to really parallel the beauty and simplicity of BeautifulSoup, allowing developers to easily craft expressions to find elements they're looking for.

BeautifulSoup takes advantage of loose-binding and named parameters to make this happen. For example, to find all a tags with an id of test and a title that contains the word foo, I could do:

soup.find_all('a', id='test', title=re.compile('foo'))

However, C# doesn't have a concept of an arbitrary number of named elements. The .NET4 Runtime has named parameters, however they have to match an existing method prototype.

My Question: What is the C# design pattern that most parallels this Pythonic construct?

Some Ideas:

I'd like to go after this based on how I, as a developer, would like to code. Implementing this is out of the scope of this post. One idea I has would be to use anonymous types. Something like:

soup.FindAll("a", new { Id = "Test", Title = new Regex("foo") });

Though this syntax loosely matches the Python implementation, it still has some disadvantages.

  1. The FindAll implementation would have to use reflection to parse the anonymous type, and handle any arbitrary metadata in a reasonable manner.
  2. The FindAll prototype would need to take an Object, which makes it fairly unclear how to use the method unless you're well familiar with the documented behavior. I don't believe there's a way to declare a method that must take an anonymous type.

Another idea I had is perhaps a more .NET way of handling this but strays further away from the library's Python roots. That would be to use a fluent pattern. Something like:

soup.FindAll("a")
    .Attr("id", "Test")
    .Attr("title", new Regex("foo"));

This would require building an expression tree and locating the appropriate nodes in the DOM.

The third and last idea I have would be to use LINQ. Something like:

var nodes = (from n in soup
             where n.Tag == "a" &&
             n["id"] == "Test" &&
             Regex.Match(n["title"], "foo").Success
             select n);

I'd appreciate any insight from anyone with experience porting Python code to C#, or just overall recommendations on the best way to handle this situation.

Mike Christensen
  • 88,082
  • 50
  • 208
  • 326
  • 7
    As much as I love Python - always aim at the audience that will use it. If you are writing it for .NET, do it in the style that they use. Look at existing .NET libraries and see what the practices are (or wait for someone to tell you here) and use those - don't try and match the Python version, you are not using Python. – Gareth Latty May 03 '12 at 15:59
  • i agree with Lattyware. If you want to use BeautifulSoup from C#, couldn't you just run it through IronPyhon? – mata May 03 '12 at 16:03
  • Isn't this what XPath is for? – Ignacio Vazquez-Abrams May 03 '12 at 16:04
  • @Lattyware - Yea, part of me just says *screw it* and wants to simply write the best DOM parser for .NET. It would be largely *inspired* by BeautifulSoup, but in no way a port. I've also considered writing this library as a set of extension methods to AgilityPack, which is perhaps a way go too. – Mike Christensen May 03 '12 at 16:07
  • @mata - First, I wonder how tough it would be to get BeautifulSoup to compile under IronPython. Second, I wonder how the interop would work between C# code and Python code. I don't know enough about IronPython to be able to answer that. – Mike Christensen May 03 '12 at 16:08
  • @IgnacioVazquez-Abrams - AgiltyPack is all XPath based, and I find it way too limited. Very simple expressions work well, but if you want to really programmatically define what you want, you'll run into trouble quickly. It could also be the fact that AgilityPack's XPath parser is very limited and doesn't support things like *contains* or case-insensitive matches. For example, I wasn't able to find nodes that contain the class *foo* unless *foo* was the only class on the node. – Mike Christensen May 03 '12 at 16:12
  • @Mike - well, because it consists only of pure python code and should run fine, also [this](http://stackoverflow.com/questions/118654/iron-python-beautiful-soup-win32-app). but it was only a suggestion. – mata May 03 '12 at 16:20
  • @mata - Interesting thread. The accepted answer seems to be "Use AgilityPack!" and the topic of interop between Python and C# code was not addressed. – Mike Christensen May 03 '12 at 16:29
  • @MikeChristensen - the question was whether it runs in IronPython, and that was addressed in the [second answer](http://stackoverflow.com/a/6549240/1350899), how to execute python code from C# is a [different](http://stackoverflow.com/questions/3002402/calling-python-app-script-from-c-sharp) [matter](http://blogs.msdn.com/b/charlie/archive/2009/10/25/hosting-ironpython-in-a-c-4-0-program.aspx). but if you want to write your own parser, that's good too. – mata May 03 '12 at 17:04
  • @mata - Yea, my goal is to create an open-source library that competes with AgilityPack and is inspired by BeautifulSoup. Thus, I want it to be purely managed/C# code. I'd either write my own parser, or use something else open-source if the license allows. – Mike Christensen May 03 '12 at 17:23

1 Answers1

1

Have you try to run your code inside the IronPython engine. As far as I know performs really well and you don't have to touch your python code.

Ale Miralles
  • 604
  • 8
  • 17
  • 1
    This is a great idea, however I'd like to see an example of what it would look like in C# to call into a Python-implemented method with named parameters. Does IronPython provide an interop story for this scenario? Also, doing this basically steers this question over to "How do I use a Python library in .NET?" which is not really what I was asking. – Mike Christensen May 03 '12 at 16:15