Help with Regex. Need to extract `

Question

i have <A HREF="f110111.ZIP"> and f110111 - is an arbitrary char sequence. I need C# regex match expression to extract all above.

E. g. input is

<A HREF="f110111.ZIP"><A HREF="qqq.ZIP"><A HREF="gygu.ZIP">

I want the list:

f110111.ZIP
qqq.ZIP
gygu.ZIP

http://stackoverflow.com/a/1732454/62576 – Ken White Apr 07 '16 at 03:11 — Ken White, Apr 07 '16 at 03:11

score 3 · Answer 1 · answered Apr 20 '11 at 07:52

3

What you need is the htmlagility pack/! That will allow you to read HTML in an easy manner and provide an easy way to retrieve links.

answered Apr 20 '11 at 07:52

Jaapjan

3,365
21
25

Why would you parse a complete html page when you exactly know what you want? It's a little bit overkill I think for this question. – 321X Apr 20 '11 at 08:42

vbence · Answer 2 · 2011-04-20T08:03:55.640

2

If you can have multiple dots in the filename:

<A HREF="(^["]+?).zip

If you do not have dots in the filename (just one before the zip), you can use a faster one:

<A HREF="(^[".]+)

C# example:

Pattern pattern = Pattern.compile("<A HREF=\"(^[\"]+?).zip");

Matcher matcher = pattern.matcher(buffer);
while (matcher.find()) {
    // do something with: matcher.group(1)
}

edited Apr 20 '11 at 08:03

answered Apr 20 '11 at 07:55

vbence

20,084
9
69
118

score 0 · Answer 3 · edited May 23 '17 at 12:31

0

NO NO! Do not use Regex to parse HTML!

Try an XML Parser. Or XPath perhaps.

edited May 23 '17 at 12:31

Community

1
1

answered Apr 20 '11 at 07:50

Ranhiru Jude Cooray

19,542
20
83
128

No No No. Parsing a full HTML document for this is **Crazy** with capital C. – vbence Apr 20 '11 at 07:56
@vbence: True enough :) But the OP did not specify how many links were there. Anyway, you would eventually get frustrated by trying to match all the possible scenario using RegEx. – Ranhiru Jude Cooray Apr 20 '11 at 07:59
I can think of situations where using Regular Expressions would be more robust than using a DOM tree (eg, if the links are not in uniform locations). This is exactly what Regex was built for. Use the right tool for the right job. – Mike Caron Apr 20 '11 at 08:08
@Mike Caron Although HTML is a basic tree structure. When you want to process it hierarchy-aware RegEx won't be much help. - This question does not seem to need it, that's why I vouch for RegEx. – vbence Apr 20 '11 at 08:14

score 0 · Answer 4 · answered Apr 20 '11 at 07:57

0

Try this one:

/<a href="([^">]+.ZIP)/gi

answered Apr 20 '11 at 07:57

jerone

16,206
4
39
57

score 0 · Answer 5 · answered Apr 20 '11 at 08:39

I think Regular Expressions are a great way to filter text out of a given text.

This regex gets the File, Filename and Extension from the given text.

href="(?<File>(?<Filename>.*?)(?<Ext>\.\w{1,3}))"

Regex above expects an extension that exists out of word characters a-z A-Z 0-9, between 1 and 3 characters.

C# Code sample:

string regex = "href=\"(?<File>(?<Filename>.*?)(?<Ext>\\.\\w{1,3}))\"";
RegexOptions options = ((RegexOptions.IgnorePatternWhitespace | RegexOptions.Multiline) | RegexOptions.IgnoreCase);
Regex reg = new Regex(regex, options);

Help with Regex. Need to extract `

5 Answers5