Which solutions are faster when extract content from webcrawler

Question

I have made a web crawler by using Asp.net. It's work well. Problem is when I want to extract content from it. Some of content wrap by between HTML tags. I have some of solutions to extract content from it but I don't know which one are better. It should be good performance and easy to implement.

Using Regex with many patterns to extact content.
Using Linq to XML to extract content.
Using XPath to extract content.

Somebody please help me choose the better solutions. I think I will go with XPath but I am not sure about performance are better than RegEx or Linq2XML.

Many thanks for any ideas.

score 4 · Accepted Answer · edited Nov 28 '17 at 19:45

4

None of your solutions is particularly good.

HTML is not a regular language and as such is not a good fit for regular expressions. See also the standard response to parsing HTML with regex.
HTML is not necessarily valid XML

Instead, you should use a HTML parsing library like the Html Agility Pack.

edited Nov 28 '17 at 19:45

carla

1,970
1
31
44

answered May 02 '13 at 14:10

Daniel Hilgarth

171,043
40
335
443

score 3 · Answer 2 · answered May 02 '13 at 14:09

3

Neither. Use a proper HTML parser such as HTML Agility Pack

answered May 02 '13 at 14:09

Darko Kenda

4,781
1
28
31

score 3 · Answer 3 · answered May 02 '13 at 14:11

RegEx is no doubt faster than both Linq to XML and XPath way. But you cannot parse everything out of the html markup using RegEx. Html is too complex for that purpose.

I didn't design my own Crawler though, I used arachnode.net, and it crawls massive amount of data. And everywhere I've used Html Agility Pack to extract various components i.e. Html Controls, Cookies, MetaTags etc etc.

score 3 · Answer 4 · answered May 02 '13 at 14:14

3

As the other guys already hinted - use proper HTML parser. In most cases, HTML is not written good enough to be treated as XML. What's worse, HTML5 pushes for syntax that is completely non parseable. For example, HTML5 allows you to omit quotes around attributes.

Along with HTML Agility Pack, you can take a look at Majestic-12's HTML Parser: Majestic-12 : Projects : C# HTML parser (.NET).

answered May 02 '13 at 14:14

Toni Petrina

7,014
1
25
34

Thanks. I will look at it as another approach beside Html Agility Pack. – Tim Phan May 02 '13 at 14:16

Which solutions are faster when extract content from webcrawler

4 Answers4