Web page(html) scraping using C#

Question

This is just a general question. Currently I am doing webpage scraping using regex. But I think it is sometimes too difficult to figure out the regular expression, so I am thinking is XSL/XPath an alternative to regex in C#?

Also, I would like to know if there are more advanced techniques for webpage scraping other than the two listed above. Thanks.

xsl/xpath requires that the page is XHTML 1.0, not all html conforms to something that is easily consumed by an xml parser — rene, Feb 16 '11 at 18:23
@rene: is this a clean-cut? If the webpage is XHTML1.0, the Xpath can be used in C#, if it is not XHTML1.0, then just seek other alternatives? — Kevin, Feb 16 '11 at 18:38
There is a difference between the claim (in the doctype) about xhtml 1.0 and actually being xhtml 1.0 compliant. No, sorry, no clean-cuts. But it looks you already have a great answer — rene, Feb 16 '11 at 18:40
surprisingly enough, the best answers to a question like this have been posted in an older and more specific question: https://stackoverflow.com/questions/18065526/pulling-data-from-a-webpage-parsing-it-for-specific-pieces-and-displaying-it/33756899 — knocte, Jul 05 '19 at 06:15

score 7 · Accepted Answer · edited Nov 24 '17 at 20:50

7

You may take a look at SgmlReader or Html Agility Pack which are HTML parsing libraries for .NET.

edited Nov 24 '17 at 20:50

carla

1,970
1
31
44

answered Feb 16 '11 at 18:23

Darin Dimitrov

1,023,142
271
3,287
2,928

1

From the NuGet package manager, run: `Install-Package HtmlAgilityPack` and you're set :) – R. Martinho Fernandes Feb 16 '11 at 18:24
@Martinho, personally I prefer `SgmlReader` but `Html Agility Pack` works also fine :-) – Darin Dimitrov Feb 16 '11 at 18:25
does this imply using to transform html to a well-formed xml, then do Xpath to that xml document in C#? – Kevin Feb 16 '11 at 18:44
@Robert, yes you get an XmlDocument from your HTML page which gives you full access to the DOM tree. – Darin Dimitrov Feb 16 '11 at 18:53
For completeness: you can also do queries with jQuery-like CSS selectors using [Fizzler](http://code.google.com/p/fizzler). Currently it works with the HtmlAgilityPack, but I believe there are plans to support other parsing libraries. – R. Martinho Fernandes Feb 16 '11 at 18:59

score 0 · Answer 2 · answered Nov 11 '19 at 10:22

0

Easy way to gather data from a web page is WebsiteParser. It's based on Html Agility Pack and you can simply describe your properties using attributes and CSS selectors.

Github here

answered Nov 11 '19 at 10:22

jasniec

55
1
10

Web page(html) scraping using C#

2 Answers2