4

This is just a general question. Currently I am doing webpage scraping using regex. But I think it is sometimes too difficult to figure out the regular expression, so I am thinking is XSL/XPath an alternative to regex in C#?

Also, I would like to know if there are more advanced techniques for webpage scraping other than the two listed above. Thanks.

Kevin
  • 6,711
  • 16
  • 60
  • 107
  • 2
    xsl/xpath requires that the page is XHTML 1.0, not all html conforms to something that is easily consumed by an xml parser – rene Feb 16 '11 at 18:23
  • @rene: is this a clean-cut? If the webpage is XHTML1.0, the Xpath can be used in C#, if it is not XHTML1.0, then just seek other alternatives? – Kevin Feb 16 '11 at 18:38
  • 1
    There is a difference between the claim (in the doctype) about xhtml 1.0 and actually being xhtml 1.0 compliant. No, sorry, no clean-cuts. But it looks you already have a great answer – rene Feb 16 '11 at 18:40
  • surprisingly enough, the best answers to a question like this have been posted in an older and more specific question: https://stackoverflow.com/questions/18065526/pulling-data-from-a-webpage-parsing-it-for-specific-pieces-and-displaying-it/33756899 – knocte Jul 05 '19 at 06:15

2 Answers2

7

You may take a look at SgmlReader or Html Agility Pack which are HTML parsing libraries for .NET.

carla
  • 1,970
  • 1
  • 31
  • 44
Darin Dimitrov
  • 1,023,142
  • 271
  • 3,287
  • 2,928
0

Easy way to gather data from a web page is WebsiteParser. It's based on Html Agility Pack and you can simply describe your properties using attributes and CSS selectors.

Github here

jasniec
  • 55
  • 1
  • 10