I'm investigating the possibility of porting the Python library Beautiful Soup over to .NET. Mainly, because I really love the parser and there's simply no good HTML parsers on the .NET framework (Html Agility Pack is outdated, buggy, undocumented and doesn't work well unless the exact schema is known.)
One of my primary goals is to get the basic DOM selection functionality to really parallel the beauty and simplicity of BeautifulSoup, allowing developers to easily craft expressions to find elements they're looking for.
BeautifulSoup takes advantage of loose-binding and named parameters to make this happen. For example, to find all a
tags with an id
of test
and a title
that contains the word foo, I could do:
soup.find_all('a', id='test', title=re.compile('foo'))
However, C# doesn't have a concept of an arbitrary number of named elements. The .NET4 Runtime has named parameters, however they have to match an existing method prototype.
My Question: What is the C# design pattern that most parallels this Pythonic construct?
Some Ideas:
I'd like to go after this based on how I, as a developer, would like to code. Implementing this is out of the scope of this post. One idea I has would be to use anonymous types. Something like:
soup.FindAll("a", new { Id = "Test", Title = new Regex("foo") });
Though this syntax loosely matches the Python implementation, it still has some disadvantages.
- The
FindAll
implementation would have to use reflection to parse the anonymous type, and handle any arbitrary metadata in a reasonable manner. - The
FindAll
prototype would need to take anObject
, which makes it fairly unclear how to use the method unless you're well familiar with the documented behavior. I don't believe there's a way to declare a method that must take an anonymous type.
Another idea I had is perhaps a more .NET way of handling this but strays further away from the library's Python roots. That would be to use a fluent pattern. Something like:
soup.FindAll("a")
.Attr("id", "Test")
.Attr("title", new Regex("foo"));
This would require building an expression tree and locating the appropriate nodes in the DOM.
The third and last idea I have would be to use LINQ. Something like:
var nodes = (from n in soup
where n.Tag == "a" &&
n["id"] == "Test" &&
Regex.Match(n["title"], "foo").Success
select n);
I'd appreciate any insight from anyone with experience porting Python code to C#, or just overall recommendations on the best way to handle this situation.