Parsing HTML/CSS/PHP File(s)

Question

I'm in need of some kind of library, possibly HTMLAgilityPack? that can parse a HTML file and a CSS file. But it's kinda tricky, because a HTML(PHP) file might contain PHP code in it, and also CSS code in it, so there's no way in hell i'm going to even attempt to battle this on my own without the help of some library.

I'm using C#/WinForms with Visual Studio Express 2010. So far the only one I can find is HTML AgilityPack, which has no documentation that I can find, and I'm not sure it does everything I need it to do.

My exact requirement is to find a way to parse an HTML file, loop through every single tag, get its contents, get all the attributes and values etc and do something with each one of them.

Have you seen a library like this around before? Can someone please provide some help/advice on how to go about this? I'm not really looking for perfection, just simplicity and variety.

The PHP and CSS shouldn't be an issue if all you care about is the HTML elements. That code would probably just be text within a `
`. Unless you need to parse the PHP too, HAP will do the trick. — mpen, Mar 11 '11 at 05:51

score 1 · Accepted Answer · edited May 23 '17 at 12:04

The HTML Agility Pack will allow you to loop through the elements as you describe. The documentation is a little thin, but it is modelled after the XmlDocument class which eases the learning curve a lot. Elements are selected using XPath queries. There is a small example of the usage here.

Here's some sample code that goes through all the elements in an HTML document (note this includes text elements, <style> elements, etc.):

var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(someHtmlString);

foreach (HtmlNode node in doc.DocumentNode.SelectNodes("*") ?? Enumerable.Empty<HtmlNode>()) {
    var contents = node.InnerHtml;
    foreach (var attribute in node.Attributes) {
        var name = attribute.Name;
        var value = attribute.Value;
    }
}

This question explains how to deal with the PHP tags (you may want to either ignore or extract them).

Yeah, don't forget that `?? Enumerable.Empty()` bit.... (much nicer than my solution btw! +1).. that's my biggest grievance about HAP. — mpen, Mar 11 '11 at 05:50

score 0 · Answer 2 · edited May 23 '17 at 11:55

Getting rid of the PHP code shouldn't be hard, and could probably be done with regular expressions (basically, you just want to strip out anything between <?php and ?>).

The CSS is just text data as far as HTML is concerned, so you can parse the HTML, pull out the contents of each <style> tag as a string, and then parse that with a CSS parser (if you even care about the contents of the CSS).

I haven't heard of Html Agility Pack before, but it looks like it would do the job, and there are a couple of SO answers that recommend it.

I also found this SO question about CSS parsers, in case you need that too.

Parsing HTML/CSS/PHP File(s)

2 Answers2