RegEx to pull out specific URL format from HTML source

Question

I'm having problems with RegEx and trying to pull out a specifically formatted HTML link from a page's HTML source.

The HTML source contains many of these links. The link is in the format:

<a class="link" href="pagedetail.html?record_id=123456">RecordName</a>

For each matching link, I would like to be able to easily extract the following two bits of information:

The URL bit. E.g. pagedetail.html?record_id=123456
The link name. E.g. RecordName

Can anyone please help with this as I'm completely stuck. I'm needing this for a C# program so if there is any C# specific notation then that would be great. Thanks

TIA

Use [HAP](http://htmlagilitypack.codeplex.com/) for this, not regex — DGibbs, Sep 12 '14 at 08:57
Related question: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Baldrick, Sep 12 '14 at 08:58

score 0 · Answer 1 · edited May 23 '17 at 12:21

0

I feel a bit silly answering this, because it should be evident through the two comments to your question, but...

You should not parse HTML with REGEX!

Use an XML parser, or better yet, a dedicated tool, like the HTML Agility Pack (which is still an XML parser, but fancier for working with HTML).

edited May 23 '17 at 12:21

Community

1
1

answered Sep 12 '14 at 09:16

MBender

5,395
1
42
69

1

However, if you're writing a one-off hacky data-scraping script, with a known set of consistently formatted input files, and certainty that it won't ever be used on a production system... then.. maybe.. sometimes even a bad approach can have a practical use! ..(runs away and hides!) – Baldrick Sep 12 '14 at 09:24
1

@Baldrick It's always a matter of balance and finding the "sweet spot" between cutting corners and being a stiff formalist. ;) – MBender Sep 12 '14 at 13:18

score 0 · Answer 2 · answered Sep 12 '14 at 09:25

You can use TagRegex and EndTagRegex classes to parse html string and find tag you want. You need to iterate through all characters in html string to find out desired tag.

e.g.

var position = 0;
var tagRegex = new TagRegex();
var endTagRegex = new EndTagRegex();

while (position < html.length)
{
    var match = tagRegex.Match(html, position);

    if (match.Success)
    {
        var tagName = match.Groups["tagname"].Value;
        if (tagName == "a") 
        { ... }
    }
    else if (endTagRegex.match(html, position).Success)
    {
        var tagName = match.Groups["tagname"].Value;
        if (tagName == "a") 
        { ... }
    }
    position++;
}

score 0 · Accepted Answer · answered Sep 12 '14 at 09:27

People will tell you you should not parse HTML with REGEX. And I think it is a valid statement.

But sometimes with well formatted HTML and really easy cases like it seems is yours. You can use some regex to do the job.

For example you can use this regex and obtain group 1 for the URL and group 2 for the RecordName

<a class="link" href="([^"]+)">([^<]+)<

DEMO

RegEx to pull out specific URL format from HTML source

3 Answers3