0

So I am trying to build a web crawler. I have started by passing the request and getting all the HTML of the page in response.

Next I thought of using regular expressions for extracting links from the HTML page. However the more I try to learn them the more tricky them seem.

Are there any alternatives to regular expressions (it may seem a discussion question but it is not I have searched the internet and haven't found a satisfactory answer).

akuzma
  • 1,592
  • 6
  • 22
  • 49
Win Coder
  • 6,628
  • 11
  • 54
  • 81

2 Answers2

2

HtmlAgilityPack is the most famous library for parsing HTML in .NET .

xanatos
  • 109,618
  • 12
  • 197
  • 280
1

Regular expressions can't be used for HTML parsing (see http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html), use proper HTML parser like HtmlAgilityPack :

http://www.nuget.org/packages/HtmlAgilityPack

Antonio Bakula
  • 20,445
  • 6
  • 75
  • 102