0

I need to grab the href value from HTML like the following in C#:

<td class="tl"><a href="http://facebook.com/"target="_blank"><img src="images/poput_icon.png"/></a>

Can anyone show me how to do this? Are RegEx's the best approach? I need to gather these from a page that contains 100s of links, but they all look like the above code. I want to ignore other href's on the page.

Thanks in advance.

Jimmy

Jimmy Collins
  • 3,294
  • 5
  • 39
  • 57
  • Don't use Regex to parse XML: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Jeff Yates Nov 23 '10 at 18:02
  • I generally agree, but remember that C# is a stickler for valid XML and will start throwing exceptions if it's given malformed input. If you know for sure the XML you get will be valid, use the XML parser. But if you're getting an arbitrary document that looks like XML but isn't, you will need some other tool, such as regular expressions, to simply pull out this specific item. No, regular expressions can't really "parse" XML, but you're trying to extract a single field from a single tag matching a clear pattern, which regular expressions can do just fine. – Adam Norberg Nov 23 '10 at 18:18

2 Answers2

1

First, don't use Regular Expressions to parse XML. See here for more detailed information on the whys and wherefores.

Second, you can use LINQ-to-XML to achieve this. Assuming you have loaded your XML snippet into an XDocument instance (and therefore, td is the root element), you can then do the following:

var href = doc
    .Element("td")
    .Element("a")
    .Attribute("href")
    .Value;
Community
  • 1
  • 1
Jeff Yates
  • 61,417
  • 20
  • 137
  • 189
  • The only reason why I would recommend a regex anyway is that .NET is generally pretty persnickety about valid XML. This looks like an automated link-scraping tool, which means it's probably working from an untrusted source- specifically, one we cannot guarantee will spit out valid XML. I think the question was mostly about identifying links in the right format. XML parsing is probably the right strategy anyway, but this specific answer doesn't address the actual target identification problem. – Adam Norberg Nov 23 '10 at 18:08
  • That's a lot of assumptions from such a simple question. Based on the information to hand and the fact that RegEx is the completely wrong approach to parsing HTML or XML, this is the most appropriate answer. If the OP adds more detail to the contrary, I will certainly look into revising my response. – Jeff Yates Nov 23 '10 at 18:10
  • "I need to gather these from a page that contains 100s of links, but they all look like the above code. I want to ignore other href's on the page." -- So while you may choose to assume that the XML is valid, you should probably explain how to use the canned XML parser to find only links that "look like" the above format. (Which isn't specified well, which is why I have an overtolerant regular expression.) Honestly, I'd like to know, personally; I don't have enough experience with the XML-parsing libraries, and would like to see the more elaborate example. – Adam Norberg Nov 23 '10 at 18:15
  • @Adam: You're correct that the existing BCL is insufficient for parsing when faced with XML that is not well-formed or indeed HTML. However, there are some libraries available that can assist in such situations (a quick Google search for "HTML Parser" for example, provides a few though I don't know the efficacy of any in particular). – Jeff Yates Nov 23 '10 at 20:20
1

I would do this with a regular expression, yes. So you want to find the value inside an anchor tag surrounding an img tag at the beginning of a table cell?

Here's C# code to create a Regex object that will match links like that, then use it, where document is a String containing the entire document to search:

Regex linkscraper = new Regex(@"<\s*td[^>]*>\s*<\s*a[^>]*href\s*=\s*""(?<link>[^""]*)""[^>]>\s*<\s*img[^>]*>\s*<\s*\/a\s*>");
MatchCollection links = linkscraper.matches(document);

Matching links are in Match objects in the Links collection, with the group name "link".

The leading @ turns this into a raw string: all \ are passed through directly, rather than being processed, so we aren't forced to double them to allow regular expression \ behavior. Since quotes can't be escaped with \" in a raw string, they're escaped with "".

This is a fairly complicated regular expression. Breaking it down:

  • It's splattered with a bunch of \s* elements, roughly meaning "any whitespace, or none". It makes your linkscraper expression ignore variations in spacing allowed by HTML.
  • The [^>] character class matches anything that isn't a ">"; repeating it (the trailing *) represents "other stuff inside the tag that we don't care about". The exclusion is to prevent the regex from going haywire and going outside a tag. Regular expressions are greedy, so it will cheerfully match the first part of the first tag in the document continued all the way to the end of the last one if we don't do this.
  • With all those pieces explained, it's relatively simple to understand:
    • a TD tag (which may or may not have spaces, or attributes), immediately followed by (for definitions of "immediately" that allow arbitrary whitespace)
    • an A tag, where the href is captured into a capturing group named "link". The [^""], which is an escaped form of [^"], matches all non-quote characters. We don't care about the rest of the tag.
    • An img tag, which can contain whatever it wants.
    • The /a closing tag.

If you know more about the exact formatting of the document you are trying to extract links from, you can tighten up this regular expression. Specifically, the [^>]* groups, the "match zero or more characters that aren't >" blocks used to allow tags to contain whatever they want, should probably be replaced by subexpressions more specific to the actual document. This will catch anything of the form <TD><A href=...><IMG></a>, which may or may not match more than you want it to.

Adam Norberg
  • 3,028
  • 17
  • 22
  • Don't use regular expressions to parse HTML or XML: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Jeff Yates Nov 23 '10 at 18:11
  • Don't use the canned XML parser to parse untrusted HTML documents that are likely to be slightly (or potentially severely) invalid. I'm assuming this is an automated link scraper (potentially evil, I acknowledge, but I might as well answer the question), because otherwise there's almost certainly some other, more practical way to extract the data, like reading the original data-source that generated this table of hundreds of links. Malforming the HTML in browser-friendly ways is the usual first line of defense against exactly these shenanigans. – Adam Norberg Nov 23 '10 at 18:13
  • @Adam Norberg - nothing evil :-) - The actual page I need to get the links from is www.google.com/adplanner/static/top1000/ (top 1000 sites in the world) to validate them against a tool which blocks nasty websites, (i.e. make sure these are NOT blocked). – Jimmy Collins Nov 23 '10 at 19:35
  • BTW, I get zero matches using the above RegEx. – Jimmy Collins Nov 23 '10 at 19:37
  • Verifying my own regular expression- I'm wrong about needing escapes for angle brackets. Verifying my expression with RegexBuddy, which I should have used in the first place, that should be my only mistake. That said, Google is very likely to produce valid XML, so here Jeff is correct- you're much better off with the XML libraries. – Adam Norberg Nov 23 '10 at 19:58