Finding Link Text with Regular Expressions

Question

Team:

I need some help with some regular expressions. The goal is to be able to identify three different ways that users might express links in a note, and those are as follows.

<a href="http://www.msn.com">MSN</a>

possibilities

    http://www.msn.com     OR
    https://www.msn.com    OR
    www.msn.com

Then by being able to find them I can change each one of them to real A tags as necessary. I realize the first example is already an A tag but I need to add some attributes to it specific to our application -- such as TARGET and ONCLICK.

Now, I have regular expressions that can find each one of those individually, and those are as follows, respective to the examples above.

<a?\w+((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*)/?>
(http|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&amp;:/~\+#]*[\w\-\@?^=%&amp;/~\+#])?
[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&amp;:/~\+#]*[\w\-\@?^=%&amp;/~\+#])?

But the problem is that I can't run all of them on the string because the second one will match a part of the first one and the third one will match a part of both the first and second. At any rate -- I need to be able to find the three permutations distinctly so I can replace each one of them individually -- because the third expression for example will need http:// added to it.

I look forward to everybodys assistance!

See SO question [What is the best way to parse html in C#?](http://stackoverflow.com/questions/56107/what-is-the-best-way-to-parse-html-in-c) — Olivier Jacot-Descombes, Feb 25 '12 at 18:03

jCoder · Accepted Answer · 2012-03-03T22:14:50.500

Assuming that the link starts or ends either with a space or at beginnd/end of line (or inside an existing A tag) I came up with the following code, which also includes some sample texts:

string regexPattern = "((?:<a (?:.*?)href=\")|^|\\s)((?:http[s]?://)?(?:\\S+)(?:\\.(?:\\S+?))+?)((?:\"(?:.*?)>(.*?)</a>)|\\s|$)";
string[] examples = new string[] {
    "some text <a href=\"http://www.msn.com/path/file?page=some.page&subpage=9#jump\">MSN</a>  more text",
    "some text http://www.msn.com/path/file?page=some.page&subpage=9#jump more text",
    "some text http://www.msn.com/path/file?page=some.page&subpage=9#jump more text",
    "some text https://www.msn.com/path/file?page=some.page&subpage=9#jump more text",
    "some text www.msn.com/path/file?page=some.page&subpage=9#jump",
    "www.msn.com/path/file?page=some.page&subpage=9#jump more text"
};

Regex re = new Regex(regexPattern);
foreach (string s in examples) {
    MatchCollection mc = re.Matches(s);
    foreach (Match m in mc) {
        string prePart = m.Groups[1].Value;
        string actualLink = m.Groups[2].Value;
        string postPart = m.Groups[3].Value;
        string linkText = m.Groups[4].Value;
        MessageBox.Show(" prePart: '" + prePart + "'\n actualLink: '" + actualLink + "'\n postPart: '" + postPart + "'\n linkText: '" + linkText + "'");
    }
}

As this code uses groups with numbers it should be possible to use the regular expression in JavaScript too.

Depending on what you need to do with the existing A tag you need to parse the particular first group as well.

Update: Modified the regex as requested so that the link Text becomes group no. 4

Update 2: To better catch malformed links you might try this modified version:

pattern = "((?:<a (?:.*?)href=\"?)|^|\\s)((?:http[s]?://)?(?:\\S+)(?:\.(?:[^>\"\\s]+))+)((?:\"?(?:.*?)>(.*?)</a>)|\\s|$)";

This seems to be working awesome, but one question I have is can you modify the Regex some so that the text between the tags is in another group? — Mike Perrenoud, Feb 25 '12 at 19:43
I've updated the code to make the link text become group no. 4 — jCoder, Feb 25 '12 at 21:00
I'm using the Regex but it's not capturing the link text when there's an A tag. — Mike Perrenoud, Mar 01 '12 at 00:14
Scratch that, I just needed to change up some options so that it would run the Regex right. Sorry about that. — Mike Perrenoud, Mar 01 '12 at 00:34
I have a question. This Regex, when used with this replace `link = match.replace(PATTERN, "$1$2\" class=\"BC_ANCHOR\" target=\"_blank\" onclick=\"preventDualEditing(event)$3");` results in an empty string if the user enters a malformed link like ` — Mike Perrenoud, Mar 03 '12 at 20:29
The pattern in the original answer heavily relies on the link href ending with "> so I added an additional version which also should work with malformed links, at least to a certain degree. For playing with regex I really recommend http://gskinner.com/RegExr/ (which also has an AIR based desktop version). — jCoder, Mar 03 '12 at 22:16

score 0 · Answer 2 · answered Feb 25 '12 at 18:14

0

Well, if we want to do in a single pass, you could create name groups for each scenario:

(?<full><a?\w+((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*)/?>.*</a>)|
(?<url>(http|https)://[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&amp;:/~\+#]*[\w\-\@?^=%&amp;/~\+#])?)|
(<?www>[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&amp;:/~\+#]*[\w\-\@?^=%&amp;/~\+#])?)

Then you would have to check which was the matched group:

Match match = regex.Match(pattern);

if (match.Success)
{
    if (match.Groups["full"].Success) 
       Console.WriteLine(match.Groups["full"].Value);
    else if (match.Groups["url"].Success)
    ....
}

answered Feb 25 '12 at 18:14

Bruno Silva

3,077
18
20

Two questions -- first, can I issue a REPLACE on the named groups? Second, if so, can that same REPLACE be issued on a REPLACE in JavaScript? – Mike Perrenoud Feb 25 '12 at 18:38
Hmm, you're right, that would probably require some complex solution using `Substring()` and the Index/Length properties of Group. If you're ok with 3 executions, you could try updating the 2nd and 3rd regexes to exclude the previous ones (probably with a look-behind). Something like [this](http://stackoverflow.com/questions/6005609/replace-some-groups-with-regex-but-maintain-the-entire-text). – Bruno Silva Feb 25 '12 at 18:57

Finding Link Text with Regular Expressions

2 Answers2