Parsing an HREF from an HTML string using a regular expression

Question

I need to parse a link to a zip file out of html. The name of this zipfile changes every month. Here is a snippet of the HTML I need to parse:

<a href="http://nppes.viva-it.com/NPPES_Data_Dissemination_Mar_2012.zip">

The string I need to get is "http://nppes.viva-it.com/NPPES_Data_Dissemination_Mar_2012.zip" so I can download the file using WebClient. The only portion of that zip file URL that remains constant from month to month is "http://nppes.viva-it.com/". Is there a way using a regular expression to parse the full URL, "http://nppes.viva-it.com/NPPES_Data_Dissemination_Mar_2012.zip", out of the HTML?

In the general case, using a regular expression to parse HTML *won't* work. However narrow you build the pattern, a perfectly legal HTML file can defeat it. Use a real parser — Michael Lorton, Apr 13 '12 at 01:03
See: http://stackoverflow.com/questions/56107/what-is-the-best-way-to-parse-html-in-c — David Peden, Apr 13 '12 at 01:50

score 1 · Answer 1 · answered Apr 13 '12 at 09:33

By using HtmlAgilityPack:

var html = "<a href=\"http://nppes.viva-it.com/NPPES_Data_Dissemination_Mar_2012.zip\">";
var doc = new HtmlDocument();
doc.LoadHtml(html);
var anchor = doc.DocumentNode.SelectSingleNode("//a");
var href = anchor.GetAttributeValue("href", null);

now href variable holds "http://nppes.viva-it.com/NPPES_Data_Dissemination_Mar_2012.zip" value.

Isn't it simplier than regex?

score 0 · Accepted Answer · answered Apr 12 '12 at 23:54

0

If there will only ever be one ZIP linked to on the page, no problem:

Regex re = new Regex(@"http://nppes\.viva-it\.com/.+\.zip");

re.Match(html).Value // To get the matched URL

Here's a demo.

answered Apr 12 '12 at 23:54

Ry-

218,210
55
464
476

score 0 · Answer 3 · 2012-04-13T00:54:02.347

Here is a raw regex - uses branch reset.
The answer is in capture buffer 2.

<a 
  (?=\s) 
  (?= (?:[^>"']|"[^"]*"|'[^']*')*? (?<=\s)
    href \s*=
    (?|
        (?> \s* (['"]) \s* (http://nppes\.viva-it\.com/ (?:(?!\g{-2}) .)+ \.zip ) \s*     \g{-2} )
      | (?> (?!\s*['"]) \s* () (http://nppes\.viva-it\.com/ [^\s>]* \.zip ) (?=\s|>) )
    )
  )
  \s+ (?:".*?"|'.*?'|[^>]*?)+ 
>

Not sure if C# can do branch reset. If it can't, this variation works.
The answer is always the result of capture buffer 2 catted with capture buffer 3.

<a 
  (?=\s) 
  (?= (?:[^>"']|"[^"]*"|'[^']*')*? (?<=\s)
    href \s*=
    (?:
        (?> \s* (['"]) \s* (http://nppes\.viva-it\.com/ (?:(?!\g{-2}) .)+ \.zip ) \s* \g{-2} )
      | (?> (?!\s*['"]) \s* (http://nppes\.viva-it\.com/ [^\s>]* \.zip ) (?=\s|>) )
    )
  )
  \s+ (?:".*?"|'.*?'|[^>]*?)+ 
>

Parsing an HREF from an HTML string using a regular expression

3 Answers3