0

I need to parse a link to a zip file out of html. The name of this zipfile changes every month. Here is a snippet of the HTML I need to parse:

<a href="http://nppes.viva-it.com/NPPES_Data_Dissemination_Mar_2012.zip">

The string I need to get is "http://nppes.viva-it.com/NPPES_Data_Dissemination_Mar_2012.zip" so I can download the file using WebClient. The only portion of that zip file URL that remains constant from month to month is "http://nppes.viva-it.com/". Is there a way using a regular expression to parse the full URL, "http://nppes.viva-it.com/NPPES_Data_Dissemination_Mar_2012.zip", out of the HTML?

Ry-
  • 218,210
  • 55
  • 464
  • 476
  • In the general case, using a regular expression to parse HTML *won't* work. However narrow you build the pattern, a perfectly legal HTML file can defeat it. Use a real parser – Michael Lorton Apr 13 '12 at 01:03
  • See: http://stackoverflow.com/questions/56107/what-is-the-best-way-to-parse-html-in-c – David Peden Apr 13 '12 at 01:50

3 Answers3

1

By using HtmlAgilityPack:

var html = "<a href=\"http://nppes.viva-it.com/NPPES_Data_Dissemination_Mar_2012.zip\">";
var doc = new HtmlDocument();
doc.LoadHtml(html);
var anchor = doc.DocumentNode.SelectSingleNode("//a");
var href = anchor.GetAttributeValue("href", null);

now href variable holds "http://nppes.viva-it.com/NPPES_Data_Dissemination_Mar_2012.zip" value.

Isn't it simplier than regex?

Oleks
  • 31,955
  • 11
  • 77
  • 132
0

If there will only ever be one ZIP linked to on the page, no problem:

Regex re = new Regex(@"http://nppes\.viva-it\.com/.+\.zip");

re.Match(html).Value // To get the matched URL

Here's a demo.

Ry-
  • 218,210
  • 55
  • 464
  • 476
0

Here is a raw regex - uses branch reset.
The answer is in capture buffer 2.

<a 
  (?=\s) 
  (?= (?:[^>"']|"[^"]*"|'[^']*')*? (?<=\s)
    href \s*=
    (?|
        (?> \s* (['"]) \s* (http://nppes\.viva-it\.com/ (?:(?!\g{-2}) .)+ \.zip ) \s*     \g{-2} )
      | (?> (?!\s*['"]) \s* () (http://nppes\.viva-it\.com/ [^\s>]* \.zip ) (?=\s|>) )
    )
  )
  \s+ (?:".*?"|'.*?'|[^>]*?)+ 
>

Not sure if C# can do branch reset. If it can't, this variation works.
The answer is always the result of capture buffer 2 catted with capture buffer 3.

<a 
  (?=\s) 
  (?= (?:[^>"']|"[^"]*"|'[^']*')*? (?<=\s)
    href \s*=
    (?:
        (?> \s* (['"]) \s* (http://nppes\.viva-it\.com/ (?:(?!\g{-2}) .)+ \.zip ) \s* \g{-2} )
      | (?> (?!\s*['"]) \s* (http://nppes\.viva-it\.com/ [^\s>]* \.zip ) (?=\s|>) )
    )
  )
  \s+ (?:".*?"|'.*?'|[^>]*?)+ 
>