59

I need a regex pattern for finding web page links in HTML.

I first use @"(<a.*?>.*?</a>)" to extract links (<a>), but I can't fetch href from that.

My strings are:

  1. <a href="www.example.com/page.php?id=xxxx&name=yyyy" ....></a>
  2. <a href="http://www.example.com/page.php?id=xxxx&name=yyyy" ....></a>
  3. <a href="https://www.example.com/page.php?id=xxxx&name=yyyy" ....></a>
  4. <a href="www.example.com/page.php/404" ....></a>

1, 2 and 3 are valid and I need them, but number 4 is not valid for me (? and = is essential)


Update: I don't need parsing <a>. I have a list of links in href="abcdef" format.

I need to fetch href of the links and filter it, my favorite urls must be contain ? and = like page.php?id=5

MrRolling
  • 2,145
  • 1
  • 24
  • 33

10 Answers10

105

I'd recommend using an HTML parser over a regex, but still here's a regex that will create a capturing group over the value of the href attribute of each links. It will match whether double or single quotes are used.

<a\s+(?:[^>]*?\s+)?href=(["'])(.*?)\1

You can view a full explanation of this regex at here.

Snippet playground:

const linkRx = /<a\s+(?:[^>]*?\s+)?href=(["'])(.*?)\1/;
const textToMatchInput = document.querySelector('[name=textToMatch]');

document.querySelector('button').addEventListener('click', () => {
  console.log(textToMatchInput.value.match(linkRx));
});
<label>
  Text to match:
  <input type="text" name="textToMatch" value='<a href="google.com"'>
  
  <button>Match</button>
 </label>
plalx
  • 42,889
  • 6
  • 74
  • 90
  • Nice one. I got a bit lost with the `\s+)?`. Maybe this would be simpler or maybe I'm missing something? `]*?href="([^"]*)"` – Mosty Mostacho Jan 13 '14 at 15:21
  • 1
    @MostyMostacho The additionnal `\s+` is necessary, otherwise ` – plalx Jan 13 '14 at 15:43
  • @plalx note that this does not match `href='//example.com'` and `href=//example.com` but only the contents in between double quotes. Unfortunately some people are still using one of those two options (and browsers accept those). How would the regex be if those two are included also? – Bob Ortiz Jul 08 '16 at 17:02
  • I don't know (?:[^>]*?\s+)? what is mean?can you explain it?thx – Mervyn Mar 23 '17 at 07:33
  • @EvanderConsus I have fixed the regex to support both quoting types. As long as backreferences are supported in C# it will work. – plalx Mar 23 '17 at 14:26
  • @YangMingYuan I added a link to a full explanation of the regex. – plalx Mar 23 '17 at 14:26
  • @plax I'm having issue with attribute values containing > char. `[^>]` has to be replaced with some other expression. Do you know what it should be? Maybe you can update solution. – Arek Kostrzeba Dec 03 '21 at 10:09
12

Using regex to parse html is not recommended

regex is used for regularly occurring patterns.html is not regular with it's format(except xhtml).For example html files are valid even if you don't have a closing tag!This could break your code.

Use an html parser like htmlagilitypack

You can use this code to retrieve all href's in anchor tag using HtmlAgilityPack

HtmlDocument doc = new HtmlDocument();
doc.Load(yourStream);

var hrefList = doc.DocumentNode.SelectNodes("//a")
                  .Select(p => p.GetAttributeValue("href", "not found"))
                  .ToList();

hrefList contains all href`s

carla
  • 1,970
  • 1
  • 31
  • 44
Anirudha
  • 32,393
  • 7
  • 68
  • 89
  • 2
    In this case, the problem is so specific that using a regex is safe enough. We can safely discard anything that is not in the form of `]* href="([^"]*)"`, unless you want to handle single-quoted attributes, but that's an easy fix. – plalx Apr 10 '13 at 13:15
  • 1
    @plalx yes indeed..but not recommended.performance would become a bottleneck if html is huge and in many cases it is – Anirudha Apr 10 '13 at 13:30
  • 4
    Have you tested performances? I am quite sure that a regex will be faster than using a parser that will have to generate a whole DOM tree. – plalx Oct 23 '13 at 17:54
  • 1
    Perhaps you should read the conversation again? =P You agreed with me then brought up the performance factor. Now you disagree AND you contradict yourself by saying performance is not important. I'm having a little trouble to follow hehe ;) – plalx Oct 23 '13 at 18:25
  • I thing parsing HTML with regex can be justifiable when the searched pattern is simple and a large number of HTML documents must be processed at once. Even then, It should be seen as an agressive optimization and only implemented once its determined that parsing is posing a performance issue. – lampyridae May 12 '16 at 14:40
  • If you're parsing just a link as a string, say from a database, and not an entire document, regex is much faster. From experience - I had A tags, 40,000 of them, and HMTLAgilityPack has way too much overhead for such a simple thing (20min vs 2 min). – MC9000 Apr 05 '21 at 14:15
12

Thanks everyone (specially @plalx)

I find it quite overkill enforce the validity of the href attribute with such a complex and cryptic pattern while a simple expression such as
<a\s+(?:[^>]*?\s+)?href="([^"]*)"
would suffice to capture all URLs. If you want to make sure they contain at least a query string, you could just use
<a\s+(?:[^>]*?\s+)?href="([^"]+\?[^"]+)"


My final regex string:


First use one of this:
st = @"((www\.|https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+ \w\d:#@%/;$()~_?\+-=\\\.&]*)";
st = @"<a href[^>]*>(.*?)</a>";
st = @"((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+@)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\+\$,\w]+@)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%@.\w_]*)#?(?:[\w]*))?)";
st = @"((?:(?:https?|ftp|gopher|telnet|file|notes|ms-help):(?://|\\\\)(?:www\.)?|www\.)[\w\d:#@%/;$()~_?\+,\-=\\.&]+)";
st = @"(?:(?:https?|ftp|gopher|telnet|file|notes|ms-help):(?://|\\\\)(?:www\.)?|www\.)";
st = @"(((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+)|(www\.)[\w\d:#@%/;$()~_?\+-=\\\.&]*)";
st = @"href=[""'](?<url>(http|https)://[^/]*?\.(com|org|net|gov))(/.*)?[""']";
st = @"(<a.*?>.*?</a>)";
st = @"(?:hrefs*=)(?:[s""']*)(?!#|mailto|location.|javascript|.*css|.*this.)(?.*?)(?:[s>""'])";
st = @"http://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\@\\#\\$\\%\\^\\&amp;\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?";
st = @"http(s)?://([\w-]+\.)+[\w-]+(/[\w- ./?%&=]*)?";
st = @"(http|https)://([a-zA-Z0-9\\~\\!\\@\\#\\$\\%\\^\\&amp;\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?";
st = @"((http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&amp;:/~\+#]*[\w\-\@?^=%&amp;/~\+#])?)";
st = @"http://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\@\\#\\$\\%\\^\\&amp;\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?";
st = @"http(s?)\:\/\/[0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*(:(0-9)*)*(\/?)([a-zA-Z0-9\-\.\?\,\'\/\\\+&amp;%\$#_]*)?$";
st = @"(?<Protocol>\w+):\/\/(?<Domain>[\w.]+\/?)\S*";

my choice is

@"(?<Protocol>\w+):\/\/(?<Domain>[\w.]+\/?)\S*"

Second Use this:

st = "(.*)?(.*)=(.*)";


Problem Solved. Thanks every one :)

Gray Programmerz
  • 479
  • 1
  • 5
  • 22
MrRolling
  • 2,145
  • 1
  • 24
  • 33
7

Try this :

 public partial class Form1 : Form
    {
        public Form1()
        {
            InitializeComponent();
        }

        private void Form1_Load(object sender, EventArgs e)
        {
            var res = Find(html);
        }

        public static List<LinkItem> Find(string file)
        {
            List<LinkItem> list = new List<LinkItem>();

            // 1.
            // Find all matches in file.
            MatchCollection m1 = Regex.Matches(file, @"(<a.*?>.*?</a>)",
                RegexOptions.Singleline);

            // 2.
            // Loop over each match.
            foreach (Match m in m1)
            {
                string value = m.Groups[1].Value;
                LinkItem i = new LinkItem();

                // 3.
                // Get href attribute.
                Match m2 = Regex.Match(value, @"href=\""(.*?)\""",
                RegexOptions.Singleline);
                if (m2.Success)
                {
                    i.Href = m2.Groups[1].Value;
                }

                // 4.
                // Remove inner tags from text.
                string t = Regex.Replace(value, @"\s*<.*?>\s*", "",
                RegexOptions.Singleline);
                i.Text = t;

                list.Add(i);
            }
            return list;
        }

        public struct LinkItem
        {
            public string Href;
            public string Text;

            public override string ToString()
            {
                return Href + "\n\t" + Text;
            }
        }

    }  

Input:

  string html = "<a href=\"www.aaa.xx/xx.zz?id=xxxx&name=xxxx\" ....></a> 2.<a href=\"http://www.aaa.xx/xx.zz?id=xxxx&name=xxxx\" ....></a> "; 

Result:

[0] = {www.aaa.xx/xx.zz?id=xxxx&name=xxxx}
[1] = {http://www.aaa.xx/xx.zz?id=xxxx&name=xxxx}

C# Scraping HTML Links

Scraping HTML extracts important page elements. It has many legal uses for webmasters and ASP.NET developers. With the Regex type and WebClient, we implement screen scraping for HTML.

Edited

Another easy way:you can use a web browser control for getting href from tag a,like this:(see my example)

 public Form1()
        {
            InitializeComponent();
            webBrowser1.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(webBrowser1_DocumentCompleted);
        }

        private void Form1_Load(object sender, EventArgs e)
        {
            webBrowser1.DocumentText = "<a href=\"www.aaa.xx/xx.zz?id=xxxx&name=xxxx\" ....></a><a href=\"http://www.aaa.xx/xx.zz?id=xxxx&name=xxxx\" ....></a><a href=\"https://www.aaa.xx/xx.zz?id=xxxx&name=xxxx\" ....></a><a href=\"www.aaa.xx/xx.zz/xxx\" ....></a>";
        }

        void webBrowser1_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
        {
            List<string> href = new List<string>();
            foreach (HtmlElement el in webBrowser1.Document.GetElementsByTagName("a"))
            {
                href.Add(el.GetAttribute("href"));
            }
        }
KF2
  • 9,887
  • 8
  • 44
  • 77
  • an anchor tag without a closing tag is **valid**..So,in that case your code would **break** or won't work...it's better to use an html parser – Anirudha Apr 10 '13 at 13:04
  • @:The_Land_Of_Devils_SriLanka: **html parser** is better for dynamic content.you are right. – KF2 Apr 10 '13 at 13:08
  • I used regex 2 times, first with this post and second this"(.*)?(.*)=(.*)" – MrRolling Apr 21 '13 at 06:04
4

Try this regex:

"href\\s*=\\s*(?:\"(?<1>[^\"]*)\"|(?<1>\\S+))"

You will get more help from discussions over:

Regular expression to extract URL from an HTML link

and

Regex to get the link in href. [asp.net]

Hope its helpful.

Community
  • 1
  • 1
Freelancer
  • 9,008
  • 7
  • 42
  • 81
4

I took a much simpler approach. This one simply looks for href attributes, and captures the value (between apostrophes) trailing it into a group named url:

href=['"](?<url>.*?)['"]

3
 HTMLDocument DOC = this.MySuperBrowser.Document as HTMLDocument;
 public IHTMLAnchorElement imageElementHref;
 imageElementHref = DOC.getElementById("idfirsticonhref") as IHTMLAnchorElement;

Simply try this code

Joee
  • 1,834
  • 18
  • 19
3

I came up with this one, that supports anchor and image tags, and supports single and double quotes.

<[a|img]+\\s+(?:[^>]*?\\s+)?[src|href]+=[\"']([^\"']*)['\"]

So

<a href="/something.ext">click here</a>

Will match:

 Match 1: /something.ext

And

<a href='/something.ext'>click here</a>

Will match:

 Match 1: /something.ext

Same goes for img src attributes

Base33
  • 3,167
  • 2
  • 27
  • 31
0

I think in this case it is one of the simplest pregmatches

/<a\s*(.*?id[^"]*")/g

gets links with the variable id in the address

starts from href including it, gets all characters/signs (. - excluding new line signs) until first id occur, including it, and next all signs to nearest next " sign ([^"]*)

0

(?<=href=")(.*?)(?=")

None of the other answers actually select the VALUE of the href so in my eyes they are all incorrect. See here for a full breakdown that's better than anything I could type here. https://regexr.com/7egrc

Be aware that this does not work in older browsers. It does work in all modern browsers. See the full list here. https://caniuse.com/js-regexp-lookbehind

Sean Patnode
  • 121
  • 1
  • 4