2

I have made a web crawler by using Asp.net. It's work well. Problem is when I want to extract content from it. Some of content wrap by between HTML tags. I have some of solutions to extract content from it but I don't know which one are better. It should be good performance and easy to implement.

  1. Using Regex with many patterns to extact content.

  2. Using Linq to XML to extract content.

  3. Using XPath to extract content.

Somebody please help me choose the better solutions. I think I will go with XPath but I am not sure about performance are better than RegEx or Linq2XML.

Many thanks for any ideas.

Tim Phan
  • 311
  • 1
  • 11

4 Answers4

4

None of your solutions is particularly good.

  1. HTML is not a regular language and as such is not a good fit for regular expressions. See also the standard response to parsing HTML with regex.
  2. HTML is not necessarily valid XML

Instead, you should use a HTML parsing library like the Html Agility Pack.

carla
  • 1,970
  • 1
  • 31
  • 44
Daniel Hilgarth
  • 171,043
  • 40
  • 335
  • 443
3

Neither. Use a proper HTML parser such as HTML Agility Pack

Darko Kenda
  • 4,781
  • 1
  • 28
  • 31
3

RegEx is no doubt faster than both Linq to XML and XPath way. But you cannot parse everything out of the html markup using RegEx. Html is too complex for that purpose.

I didn't design my own Crawler though, I used arachnode.net, and it crawls massive amount of data. And everywhere I've used Html Agility Pack to extract various components i.e. Html Controls, Cookies, MetaTags etc etc.

Manish Mishra
  • 12,163
  • 5
  • 35
  • 59
3

As the other guys already hinted - use proper HTML parser. In most cases, HTML is not written good enough to be treated as XML. What's worse, HTML5 pushes for syntax that is completely non parseable. For example, HTML5 allows you to omit quotes around attributes.

Along with HTML Agility Pack, you can take a look at Majestic-12's HTML Parser: Majestic-12 : Projects : C# HTML parser (.NET).

Toni Petrina
  • 7,014
  • 1
  • 25
  • 34