1

i have <A HREF="f110111.ZIP"> and f110111 - is an arbitrary char sequence. I need C# regex match expression to extract all above.

E. g. input is

<A HREF="f110111.ZIP"><A HREF="qqq.ZIP"><A HREF="gygu.ZIP">

I want the list:

  • f110111.ZIP
  • qqq.ZIP
  • gygu.ZIP
skaeff
  • 753
  • 2
  • 13
  • 25

5 Answers5

3

What you need is the htmlagility pack/! That will allow you to read HTML in an easy manner and provide an easy way to retrieve links.

Jaapjan
  • 3,365
  • 21
  • 25
  • Why would you parse a complete html page when you exactly know what you want? It's a little bit overkill I think for this question. – 321X Apr 20 '11 at 08:42
2

If you can have multiple dots in the filename:

<A HREF="(^["]+?).zip

If you do not have dots in the filename (just one before the zip), you can use a faster one:

<A HREF="(^[".]+)

C# example:

Pattern pattern = Pattern.compile("<A HREF=\"(^[\"]+?).zip");

Matcher matcher = pattern.matcher(buffer);
while (matcher.find()) {
    // do something with: matcher.group(1)
}
vbence
  • 20,084
  • 9
  • 69
  • 118
0

NO NO! Do not use Regex to parse HTML!

Try an XML Parser. Or XPath perhaps.

Community
  • 1
  • 1
Ranhiru Jude Cooray
  • 19,542
  • 20
  • 83
  • 128
  • No No No. Parsing a full HTML document for this is **Crazy** with capital C. – vbence Apr 20 '11 at 07:56
  • @vbence: True enough :) But the OP did not specify how many links were there. Anyway, you would eventually get frustrated by trying to match all the possible scenario using RegEx. – Ranhiru Jude Cooray Apr 20 '11 at 07:59
  • I can think of situations where using Regular Expressions would be more robust than using a DOM tree (eg, if the links are not in uniform locations). This is exactly what Regex was built for. Use the right tool for the right job. – Mike Caron Apr 20 '11 at 08:08
  • @Mike Caron Although HTML is a basic tree structure. When you want to process it hierarchy-aware RegEx won't be much help. - This question does not seem to need it, that's why I vouch for RegEx. – vbence Apr 20 '11 at 08:14
0

Try this one:

/<a href="([^">]+.ZIP)/gi
jerone
  • 16,206
  • 4
  • 39
  • 57
0

I think Regular Expressions are a great way to filter text out of a given text.

This regex gets the File, Filename and Extension from the given text.

href="(?<File>(?<Filename>.*?)(?<Ext>\.\w{1,3}))"

Regex above expects an extension that exists out of word characters a-z A-Z 0-9, between 1 and 3 characters.

C# Code sample:

string regex = "href=\"(?<File>(?<Filename>.*?)(?<Ext>\\.\\w{1,3}))\"";
RegexOptions options = ((RegexOptions.IgnorePatternWhitespace | RegexOptions.Multiline) | RegexOptions.IgnoreCase);
Regex reg = new Regex(regex, options);
321X
  • 3,153
  • 2
  • 30
  • 42