What´s better for parsing: regular expressions or linq?

Question

I am parsing webpage to windows phone 7 and I need to know what is better way to do this. The most important is the performance. I saw in example with imdb that the author uses regex but I am not sure if It woudn´t be better if I use Html Agility Pack and Linq.

P.s.: I must parse website and it´s not my website.

possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — John Saunders, Oct 10 '11 at 19:51

score 7 · Accepted Answer · edited May 23 '17 at 12:29

7

You'll be best served using the Html Agility Pack and Linq.

Parsing HTML with RegEx is quite unreliable.

edited May 23 '17 at 12:29

Community

1
1

answered Oct 10 '11 at 19:48

Justin Niessner

242,243
40
408
536

score 0 · Answer 2 · answered Oct 11 '11 at 11:46

By chance I am working on a similar subject. I don't tell you any authoritative statement as it is too early. To start with I took 3 engines:

AgilityPack
Majestic
Regex

Of course, there are a lot more options (even I wrote once a simple html viewer for Palm OS), but this seemed to be a good start.

Majestic did not offer Html->text conversion, just a sample code how to walk over the html string. To start with I implemented trivial conversion:

Write all text nodes
Convert <p> to "\n\n" and <br> to "/n"
Ignoring everything else

Then I collected a sample of 50+ html files and converted them using all 3 methods. I have to say that I wasn't happy with either method. Two general observations:

Results from Majestic and Agility were remarkably similar
Regex method was an order of magnitude slower.

So I looked into the Regex code and found a nonsense loop at the bottom. After an easy optimization Regex method was only ~25% slower. Given that it makes more than 30 complex Regex replacements, I considered this a good result.

Then I wrote a test html file containing all common html tags and a bit more. As before, Majestic and Agility performed similarly.

All engines ok: h1, p, tags written as text
All engines failed: h2+, hr, b
br: Regex failed, Majestic ok
Lists: Regex ok, Majestic failed
Simple 2x2 table: Regex ok, Majestic failed

There's a lot more to test. For example encoding.

At this moment I would only say that Regex seems to be a better alternative. However, none of the mentioned engines performs satisfactorily. On the positive note, tweaking these engines (particularly Majestic and Regex) is easy. Maybe the same holds true for Agility as well, however, I did not look into the package deep enough to say that.

A followup: I decided for Regex. Main argument is that it is just a small piece of code that's easily added to another project. Just a word of warning against the original code in the above reference. It's slow and buggy, i.e. you need to improve regex expressions used. — Jan Slodicka, Oct 14 '11 at 14:09

What´s better for parsing: regular expressions or linq?

2 Answers2