Using Regex Regular Expressions, how do I exclude certain things in links

Question

Following on from a post I made earlier, I am making progress with what I require, but not knowing much about how RegEx expressions work, I'm stuck!

This line:

FilesM = Regex.Matches(StrFile, "<link.*?href=""(.*?)"".*? />")

Is extracting from the HTML of my page, all <link.. elements to compile a combined style file.

However, I need to exclude any media="print" links.

I am also trying to combine JS scripts

FilesM1 = Regex.Matches(StrFile, "<script.*?src=""(.*?)"".*?></script>")

Does this, but in this case, I want to exclude any scripts which are not hosted locally. I'd like to do this by excluding any scripts where the href starts with "http"

So how would I exclude these two cases from the match collection?

I don't suppose you'd consider loading the HTML with an XmlDocument object so you could then easily find the elements you are looking for with XPath? — Steven Doggart, Jun 15 '12 at 12:30
Um... I don't know how! However, I think that sounds like it could cause quite a lot of additional issues. — Jamie Hartnoll, Jun 15 '12 at 12:34

Steven Doggart · Accepted Answer · 2012-06-15T13:09:31.700

1

I know this isn't exactly what you are looking for, but, in case you are interested, here's an example of how to find just the elements you care about using XPath:

Dim doc As New XmlDocument()
doc.LoadXml(html)
Dim linkNodes As XmlNodeList = doc.SelectNodes("descendant-or-self::link[(@href) and (not(@media) or (@media != 'print'))]")
Dim scriptNodes As XmlNodeList = doc.SelectNodes("descendant-or-self::script[(@src) and (not(starts-with(@src,'http')))]")

The XmlDocument.SelectNodes method returns all elements that match the given XPath.

In the XPath string, descendant-or-self:: means you want it to search all elements from the current position (the root) down through all descendants for the following element name. If that was left out, it would only look for matching elements at the current (root) level.

The [] clauses provide conditions. So for instance, link[@media != 'print'] would match all link elements that don't have a media attribute that equals "print". The @ sign specifies an attribute name.

Simply listing an attribute name by itself in a condition means that you are checking for the existence of that attribute. For instance, link[@href] matches all link elements that do have an href attribute.

edited Jun 15 '12 at 13:09

answered Jun 15 '12 at 13:00

Steven Doggart

43,358
8
68
105

Ah-ha, now you explain it like that, it looks good. Is there a way I can grab the HTML of the document, just between the `head` tags? What I'm currently using is grabbing the whole page and it can be quite slow. – Jamie Hartnoll Jun 15 '12 at 13:12
Do you mean when reading the document from disk or from a web server, you only want to read a portion of the file? Or are you saying you want to limit which portion of the document the SelectNodes method is searching? – Steven Doggart Jun 15 '12 at 13:19
1

Note that this will only work with valid xhtml. If you're working with ordinary html, then you should use the html agility pack. – Steve Wortham Jun 15 '12 at 13:23
@SteveWortham Good point. The HTML Agility Pack also supports the same SelectNodes/XPath feature and it is capable of reading HTML files that don't strictly adhere to XML well-formedness rules. – Steven Doggart Jun 15 '12 at 13:26
Argh, went for some lunch! I'm using html, not xHtml, how would I alter the code above? @SteveDog, I mean when reading the document from a webserver. – Jamie Hartnoll Jun 15 '12 at 14:03
@JamieHartnoll I do not know how to read only a portion of an html file from a web server. I suspect that may not even be possible. I'm not an expert in that area. As far as what would need to be altered, the HTML Agility Pack is an open source third-party library that you would need to download. It's not part of the .NET framework. If you choose to use that, I believe the only change that would need to be made to the above code would be the first line changing to `Dim doc As New HtmlDocument()`. I think everything else stays the same. – Steven Doggart Jun 15 '12 at 14:19
Thanks, I look into it. I'm beginning to wonder if this approach is a good idea, I think a bit more thought is required. I'll mark this as accepted though as it looks like it will do what I need, if I chose to carry on this route. – Jamie Hartnoll Jun 15 '12 at 16:28
Yeah, parsing HTML with Regex is generally considered to be a bad idea in most cases. If you are just doing something really simple, it is ok, and certainly faster, but if you are trying to do anything at all complex, it's much safer to use an HTML parser such as the HtmlAgilityPack. For a good laugh, check out the answer to this question: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Steven Doggart Jun 15 '12 at 16:47
I don't think a regular expression would be faster to write or faster to execute. It's both simpler and more efficient to parse HTML with the HTML Agility Pack. In terms of pure speed, the HTML Agility Pack is optimized for parsing HTML, whereas any contrived regular expression would probably involve a great deal of backtracking. – Steve Wortham Jun 18 '12 at 20:16

Using Regex Regular Expressions, how do I exclude certain things in links

1 Answers1