regular expression for finding 'href' value of a link

Question

I need a regex pattern for finding web page links in HTML.

I first use @"(<a.*?>.*?</a>)" to extract links (<a>), but I can't fetch href from that.

My strings are:

<a href="www.example.com/page.php?id=xxxx&name=yyyy" ....></a>
<a href="http://www.example.com/page.php?id=xxxx&name=yyyy" ....></a>
<a href="https://www.example.com/page.php?id=xxxx&name=yyyy" ....></a>
<a href="www.example.com/page.php/404" ....></a>

1, 2 and 3 are valid and I need them, but number 4 is not valid for me (? and = is essential)

Update: I don't need parsing <a>. I have a list of links in href="abcdef" format.

I need to fetch href of the links and filter it, my favorite urls must be contain ? and = like page.php?id=5

read this: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Bartlomiej Lewandowski, Apr 10 '13 at 12:43
The HtmlAgility nuget package is what I would suggest using. — asawyer, Apr 10 '13 at 12:43
You may want to checkout [CsQuery](https://github.com/jamietre/CsQuery). Similar to jQuery, it allows you to select tags and extract attributes and such. Regular expressions tend to get tricky when applied to raw html. — rudolph9, Apr 10 '13 at 12:44
hi please ckeck this:http://stackoverflow.com/questions/2450985/regex-expression-to-find-a-href-links-and-add-nofollow-to-them — user1102001, Apr 10 '13 at 12:46
Don't use regex for HTML, as explained in http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags. — L-Four, Apr 10 '13 at 12:52
tnx all but I dont need parsing `` i have a list of links in "href=xxxxxxxx" format I need fetch xxxxx of the links and filter it my favorite xxxxx must be contain '?' and '=' like `xxxx.php?id=5` tnx — MrRolling, Apr 16 '13 at 04:35

plalx · Accepted Answer · 2019-12-04T21:05:02.910

105

I'd recommend using an HTML parser over a regex, but still here's a regex that will create a capturing group over the value of the href attribute of each links. It will match whether double or single quotes are used.

<a\s+(?:[^>]*?\s+)?href=(["'])(.*?)\1

You can view a full explanation of this regex at here.

Snippet playground:

const linkRx = /<a\s+(?:[^>]*?\s+)?href=(["'])(.*?)\1/;
const textToMatchInput = document.querySelector('[name=textToMatch]');

document.querySelector('button').addEventListener('click', () => {
  console.log(textToMatchInput.value.match(linkRx));
});

<label>
  Text to match:
  <input type="text" name="textToMatch" value='<a href="google.com"'>
  
  <button>Match</button>
 </label>

edited Dec 04 '19 at 21:05

answered Apr 10 '13 at 12:49

plalx

42,889
6
74
90

Nice one. I got a bit lost with the `\s+)?`. Maybe this would be simpler or maybe I'm missing something? `]*?href="([^"]*)"` – Mosty Mostacho Jan 13 '14 at 15:21
1

@MostyMostacho The additionnal `\s+` is necessary, otherwise ` – plalx Jan 13 '14 at 15:43
@plalx note that this does not match `href='//example.com'` and `href=//example.com` but only the contents in between double quotes. Unfortunately some people are still using one of those two options (and browsers accept those). How would the regex be if those two are included also? – Bob Ortiz Jul 08 '16 at 17:02
I don't know (?:[^>]*?\s+)? what is mean?can you explain it?thx – Mervyn Mar 23 '17 at 07:33
@EvanderConsus I have fixed the regex to support both quoting types. As long as backreferences are supported in C# it will work. – plalx Mar 23 '17 at 14:26
@YangMingYuan I added a link to a full explanation of the regex. – plalx Mar 23 '17 at 14:26
@plax I'm having issue with attribute values containing > char. `[^>]` has to be replaced with some other expression. Do you know what it should be? Maybe you can update solution. – Arek Kostrzeba Dec 03 '21 at 10:09

score 12 · Answer 2 · edited Nov 26 '17 at 13:20

12

Using regex to parse html is not recommended

regex is used for regularly occurring patterns.html is not regular with it's format(except xhtml).For example html files are valid even if you don't have a closing tag!This could break your code.

Use an html parser like htmlagilitypack

You can use this code to retrieve all href's in anchor tag using HtmlAgilityPack

HtmlDocument doc = new HtmlDocument();
doc.Load(yourStream);

var hrefList = doc.DocumentNode.SelectNodes("//a")
                  .Select(p => p.GetAttributeValue("href", "not found"))
                  .ToList();

hrefList contains all href`s

edited Nov 26 '17 at 13:20

carla

1,970
1
31
44

answered Apr 10 '13 at 12:57

Anirudha

32,393
7
68
89

2

In this case, the problem is so specific that using a regex is safe enough. We can safely discard anything that is not in the form of `]* href="([^"]*)"`, unless you want to handle single-quoted attributes, but that's an easy fix. – plalx Apr 10 '13 at 13:15
1

@plalx yes indeed..but not recommended.performance would become a bottleneck if html is huge and in many cases it is – Anirudha Apr 10 '13 at 13:30
4

Have you tested performances? I am quite sure that a regex will be faster than using a parser that will have to generate a whole DOM tree. – plalx Oct 23 '13 at 17:54
1

Perhaps you should read the conversation again? =P You agreed with me then brought up the performance factor. Now you disagree AND you contradict yourself by saying performance is not important. I'm having a little trouble to follow hehe ;) – plalx Oct 23 '13 at 18:25
I thing parsing HTML with regex can be justifiable when the searched pattern is simple and a large number of HTML documents must be processed at once. Even then, It should be seen as an agressive optimization and only implemented once its determined that parsing is posing a performance issue. – lampyridae May 12 '16 at 14:40
If you're parsing just a link as a string, say from a database, and not an entire document, regex is much faster. From experience - I had A tags, 40,000 of them, and HMTLAgilityPack has way too much overhead for such a simple thing (20min vs 2 min). – MC9000 Apr 05 '21 at 14:15

score 12 · Answer 3 · edited Jan 30 '21 at 12:36

Thanks everyone (specially @plalx)

I find it quite overkill enforce the validity of the href attribute with such a complex and cryptic pattern while a simple expression such as
<a\s+(?:[^>]*?\s+)?href="([^"]*)"
would suffice to capture all URLs. If you want to make sure they contain at least a query string, you could just use
<a\s+(?:[^>]*?\s+)?href="([^"]+\?[^"]+)"

My final regex string:

First use one of this:

st = @"((www\.|https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+ \w\d:#@%/;$()~_?\+-=\\\.&]*)";
st = @"<a href[^>]*>(.*?)</a>";
st = @"((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+@)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\+\$,\w]+@)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%@.\w_]*)#?(?:[\w]*))?)";
st = @"((?:(?:https?|ftp|gopher|telnet|file|notes|ms-help):(?://|\\\\)(?:www\.)?|www\.)[\w\d:#@%/;$()~_?\+,\-=\\.&]+)";
st = @"(?:(?:https?|ftp|gopher|telnet|file|notes|ms-help):(?://|\\\\)(?:www\.)?|www\.)";
st = @"(((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+)|(www\.)[\w\d:#@%/;$()~_?\+-=\\\.&]*)";
st = @"href=[""'](?<url>(http|https)://[^/]*?\.(com|org|net|gov))(/.*)?[""']";
st = @"(<a.*?>.*?</a>)";
st = @"(?:hrefs*=)(?:[s""']*)(?!#|mailto|location.|javascript|.*css|.*this.)(?.*?)(?:[s>""'])";
st = @"http://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\@\\#\\$\\%\\^\\&amp;\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?";
st = @"http(s)?://([\w-]+\.)+[\w-]+(/[\w- ./?%&=]*)?";
st = @"(http|https)://([a-zA-Z0-9\\~\\!\\@\\#\\$\\%\\^\\&amp;\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?";
st = @"((http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&amp;:/~\+#]*[\w\-\@?^=%&amp;/~\+#])?)";
st = @"http://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\@\\#\\$\\%\\^\\&amp;\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?";
st = @"http(s?)\:\/\/[0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*(:(0-9)*)*(\/?)([a-zA-Z0-9\-\.\?\,\'\/\\\+&amp;%\$#_]*)?$";
st = @"(?<Protocol>\w+):\/\/(?<Domain>[\w.]+\/?)\S*";

my choice is

@"(?<Protocol>\w+):\/\/(?<Domain>[\w.]+\/?)\S*"

Second Use this:

st = "(.*)?(.*)=(.*)";

Problem Solved. Thanks every one :)

Please consider changing your selected answer. No one will or should ever use such a complex regex for that simple task. — plalx, Feb 21 '14 at 12:56
@plalx why? Unfortunately i don't work with regex any more- but last year that regex works for me.if you sure that u have a better alternative, say to make it best answer. tnx ;) — MrRolling, Feb 26 '14 at 18:28
Well the fact is that even though you would have the smartest regular expression in the world that can validate that the `href` content is actually a URL, you cannot assert it's a valid URL since it might not exist at all. Therefore, I find it quite overkill enforce the validity of the `href` attribute with such a complex and cryptic pattern while a simple expression such as `]*?\s+)?href="([^"]*)"` would suffice to capture all URLs. If you want to make sure they contain at least a query string, you could just use `]*?\s+)?href="([^"]+\?[^"]+)"` — plalx, Feb 26 '14 at 19:43
... why not just mark plalx's answer accepted instead if you think it's better? You're essentially duplicating content here, and that's not something we want to encourage. — BoltClock, May 06 '15 at 18:31

KF2 · Answer 4 · 2013-04-16T04:40:25.940

Try this :

 public partial class Form1 : Form
    {
        public Form1()
        {
            InitializeComponent();
        }

        private void Form1_Load(object sender, EventArgs e)
        {
            var res = Find(html);
        }

        public static List<LinkItem> Find(string file)
        {
            List<LinkItem> list = new List<LinkItem>();

            // 1.
            // Find all matches in file.
            MatchCollection m1 = Regex.Matches(file, @"(<a.*?>.*?</a>)",
                RegexOptions.Singleline);

            // 2.
            // Loop over each match.
            foreach (Match m in m1)
            {
                string value = m.Groups[1].Value;
                LinkItem i = new LinkItem();

                // 3.
                // Get href attribute.
                Match m2 = Regex.Match(value, @"href=\""(.*?)\""",
                RegexOptions.Singleline);
                if (m2.Success)
                {
                    i.Href = m2.Groups[1].Value;
                }

                // 4.
                // Remove inner tags from text.
                string t = Regex.Replace(value, @"\s*<.*?>\s*", "",
                RegexOptions.Singleline);
                i.Text = t;

                list.Add(i);
            }
            return list;
        }

        public struct LinkItem
        {
            public string Href;
            public string Text;

            public override string ToString()
            {
                return Href + "\n\t" + Text;
            }
        }

    }

Input:

  string html = "<a href=\"www.aaa.xx/xx.zz?id=xxxx&name=xxxx\" ....></a> 2.<a href=\"http://www.aaa.xx/xx.zz?id=xxxx&name=xxxx\" ....></a> ";

Result:

[0] = {www.aaa.xx/xx.zz?id=xxxx&name=xxxx}
[1] = {http://www.aaa.xx/xx.zz?id=xxxx&name=xxxx}

C# Scraping HTML Links

Scraping HTML extracts important page elements. It has many legal uses for webmasters and ASP.NET developers. With the Regex type and WebClient, we implement screen scraping for HTML.

Edited

Another easy way:you can use a web browser control for getting href from tag a,like this:(see my example)

 public Form1()
        {
            InitializeComponent();
            webBrowser1.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(webBrowser1_DocumentCompleted);
        }

        private void Form1_Load(object sender, EventArgs e)
        {
            webBrowser1.DocumentText = "<a href=\"www.aaa.xx/xx.zz?id=xxxx&name=xxxx\" ....></a><a href=\"http://www.aaa.xx/xx.zz?id=xxxx&name=xxxx\" ....></a><a href=\"https://www.aaa.xx/xx.zz?id=xxxx&name=xxxx\" ....></a><a href=\"www.aaa.xx/xx.zz/xxx\" ....></a>";
        }

        void webBrowser1_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
        {
            List<string> href = new List<string>();
            foreach (HtmlElement el in webBrowser1.Document.GetElementsByTagName("a"))
            {
                href.Add(el.GetAttribute("href"));
            }
        }

an anchor tag without a closing tag is **valid**..So,in that case your code would **break** or won't work...it's better to use an html parser — Anirudha, Apr 10 '13 at 13:04
@:The_Land_Of_Devils_SriLanka: **html parser** is better for dynamic content.you are right. — KF2, Apr 10 '13 at 13:08
I used regex 2 times, first with this post and second this"(.*)?(.*)=(.*)" — MrRolling, Apr 21 '13 at 06:04

score 4 · Answer 5 · edited May 23 '17 at 12:10

4

Try this regex:

"href\\s*=\\s*(?:\"(?<1>[^\"]*)\"|(?<1>\\S+))"

You will get more help from discussions over:

Regular expression to extract URL from an HTML link

and

Regex to get the link in href. [asp.net]

Hope its helpful.

edited May 23 '17 at 12:10

Community

1
1

answered Apr 10 '13 at 12:45

Freelancer

9,008
7
42
81

score 4 · Answer 6 · answered Dec 09 '21 at 08:32

4

I took a much simpler approach. This one simply looks for href attributes, and captures the value (between apostrophes) trailing it into a group named url:

href=['"](?<url>.*?)['"]

answered Dec 09 '21 at 08:32

Daan van den Bergh

491
4
17

score 3 · Answer 7 · answered Dec 02 '15 at 09:31

3

 HTMLDocument DOC = this.MySuperBrowser.Document as HTMLDocument;
 public IHTMLAnchorElement imageElementHref;
 imageElementHref = DOC.getElementById("idfirsticonhref") as IHTMLAnchorElement;

Simply try this code

answered Dec 02 '15 at 09:31

Joee

1,834
18
19

score 3 · Answer 8 · answered May 10 '16 at 15:56

I came up with this one, that supports anchor and image tags, and supports single and double quotes.

<[a|img]+\\s+(?:[^>]*?\\s+)?[src|href]+=[\"']([^\"']*)['\"]

So

<a href="/something.ext">click here</a>

Will match:

 Match 1: /something.ext

And

<a href='/something.ext'>click here</a>

Will match:

 Match 1: /something.ext

Same goes for img src attributes

score 0 · Answer 9 · answered Feb 25 '22 at 11:21

I think in this case it is one of the simplest pregmatches

/<a\s*(.*?id[^"]*")/g

gets links with the variable id in the address

starts from href including it, gets all characters/signs (. - excluding new line signs) until first id occur, including it, and next all signs to nearest next " sign ([^"]*)

score 0 · Answer 10 · answered May 27 '23 at 15:12

(?<=href=")(.*?)(?=")

None of the other answers actually select the VALUE of the href so in my eyes they are all incorrect. See here for a full breakdown that's better than anything I could type here. https://regexr.com/7egrc

Be aware that this does not work in older browsers. It does work in all modern browsers. See the full list here. https://caniuse.com/js-regexp-lookbehind

regular expression for finding 'href' value of a link

10 Answers10

Thanks everyone (specially @plalx)

My final regex string:

Problem Solved. Thanks every one :)

Edited

Linked

Related