2

I am parsing webpage to windows phone 7 and I need to know what is better way to do this. The most important is the performance. I saw in example with imdb that the author uses regex but I am not sure if It woudn´t be better if I use Html Agility Pack and Linq.

P.s.: I must parse website and it´s not my website.

Marc
  • 3,683
  • 8
  • 34
  • 48
Libor Zapletal
  • 13,752
  • 20
  • 95
  • 182
  • possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – John Saunders Oct 10 '11 at 19:51

2 Answers2

7

You'll be best served using the Html Agility Pack and Linq.

Parsing HTML with RegEx is quite unreliable.

Community
  • 1
  • 1
Justin Niessner
  • 242,243
  • 40
  • 408
  • 536
0

By chance I am working on a similar subject. I don't tell you any authoritative statement as it is too early. To start with I took 3 engines:

Of course, there are a lot more options (even I wrote once a simple html viewer for Palm OS), but this seemed to be a good start.

Majestic did not offer Html->text conversion, just a sample code how to walk over the html string. To start with I implemented trivial conversion:

  • Write all text nodes
  • Convert <p> to "\n\n" and <br> to "/n"
  • Ignoring everything else

Then I collected a sample of 50+ html files and converted them using all 3 methods. I have to say that I wasn't happy with either method. Two general observations:

  • Results from Majestic and Agility were remarkably similar
  • Regex method was an order of magnitude slower.

So I looked into the Regex code and found a nonsense loop at the bottom. After an easy optimization Regex method was only ~25% slower. Given that it makes more than 30 complex Regex replacements, I considered this a good result.

Then I wrote a test html file containing all common html tags and a bit more. As before, Majestic and Agility performed similarly.

  • All engines ok: h1, p, tags written as text
  • All engines failed: h2+, hr, b
  • br: Regex failed, Majestic ok
  • Lists: Regex ok, Majestic failed
  • Simple 2x2 table: Regex ok, Majestic failed

There's a lot more to test. For example encoding.

At this moment I would only say that Regex seems to be a better alternative. However, none of the mentioned engines performs satisfactorily. On the positive note, tweaking these engines (particularly Majestic and Regex) is easy. Maybe the same holds true for Agility as well, however, I did not look into the package deep enough to say that.

Jan Slodicka
  • 1,505
  • 1
  • 11
  • 14
  • A followup: I decided for Regex. Main argument is that it is just a small piece of code that's easily added to another project. Just a word of warning against the original code in the above reference. It's slow and buggy, i.e. you need to improve regex expressions used. – Jan Slodicka Oct 14 '11 at 14:09