How to parse link of web page from string?

Question

Regex linkParser = new Regex(@"\b(?:https?://|www\.)\S+\b", RegexOptions.Compiled | RegexOptions.IgnoreCase);
                    string rawString = link;
                    foreach (Match m in linkParser.Matches(rawString))
                    {
                        string links = m.Value;
                    }

Im trying to parse/get the link from this string:

<a href="http://rotter.net/cgi-bin/forum/dcboard.cgi?az=read_count&om=112190&forum=scoops1"><b>

I want to get only this part:

http://rotter.net/cgi-bin/forum/dcboard.cgi?az=read_count&om=112190&forum=scoops1

But what im getting in the string links is:

http://rotter.net/cgi-bin/forum/dcboard.cgi?az=read_count&om=112190&forum=scoops1"><b

In the end there is left ">

In an anchor or from the text? If the former `` or use a protper parser like Html Agility Pack — Alex K., Jul 04 '14 at 11:46
Please, check this [answer](http://stackoverflow.com/a/190405/982431) to another similar question. — HuorSwords, Jul 04 '14 at 11:46
Please do not repost the same question again and again, better edit your [old question](http://stackoverflow.com/questions/24551077/how-can-i-extract-a-text-from-string-variable-using-regex) and get it reopened if really necessary, but as for your other question, the optimal solution is `NOT TO USE REGEX` but a more suitable tool such as [HTML Agility Pack](http://htmlagilitypack.codeplex.com/), just because `HTML is no regular language` — DrCopyPaste, Jul 04 '14 at 12:51
possible duplicate of [How to extract href tag from a string in C#?](http://stackoverflow.com/questions/22151037/how-to-extract-href-tag-from-a-string-in-c) — Abbas, Jul 04 '14 at 12:56

strohkoenig · Answer 1 · 2014-07-04T12:45:13.293

Try changing \S+ to [^\"\>]+

Final string: \b(?:https?:\/\/|www\.)[^\"\>]+\b

But this does not only find working links. If your link would be something like <a href="www.a<not working Link>flupp"><b>, it would find www.a<not working Link.

This expression just finds everything until the next " or > (if it is a valid HTML-Form and you know that the text between both quotation marks is a normal link, you only should need " (what would let the expression become \b(?:https?:\/\/|www\.)[^\"]+\b)).

Using this it would find www.a<not working Link>flupp which is exactly what stands between both quotation marks.

If you want to forbid more chars you have to edit [^\"\>]+.

Btw: I think it could make sense to escape both / after ?:https?:

The reason for that is because you tell him to find all non-whitespace-characters and it should end with a letter. Because this expression is greedy it "eats" as many non-whitespace-characters as possible. " and > are no whitespace-characters, and (TAB) are. [^\"]+ tells him to get all chars until he finds a ". After finding one he will stop.

score 0 · Answer 2 · answered Jul 04 '14 at 13:16

I found use the HTMLAgilityPack like some people said to do since HTML is not a regular language. After you have downloaded it considering that this is the only node in the source that contains this text:

            HtmlAgilityPack.HtmlDocument hp = new HtmlAgilityPack.HtmlDocument();
            string source = File.ReadAllText( @"C:\Users\Admin\Desktop\source.txt" );
            hp.LoadHtml(source);
            var node = hp.DocumentNode.SelectSingleNode("//a[contains(@href, 'http://rotter.net/cgi-bin/forum/dcboard.cgi?az=read_count&om=112190&forum=scoops1')]");
            string found = node.Attributes["href"].Value;                        
            Console.WriteLine(found);

You can pull your source from anywhere you want, either download via webclient or a local file. This will return: http://rotter.net/cgi-bin/forum/dcboard.cgi?az=read_count&om=112190&forum=scoops1

How to parse link of web page from string?

2 Answers2