Regex replace domain substring with html tag in C#

Question

I'm trying to replace plain domain like substrings of a input string with 'a' tags, using regex like this:

var pattern = @"[A-Za-z0-9-]+(\\.[A-Za-z0-9-]+)*(\\.[A-Za-z]{2,})";

var input = "text1 www.example.com text2 <a href='foo'>www.example.com</a> text3";

var result = Regex.Replace(input, pattern, string.Format("<a href='$0'>$0</a>"));

This will create following output:

text1 <a href='www.example.com'>www.example.com</a> text2 <a href='foo'><a href='www.example.com'>www.example.com</a></a> text3

Which is wrong as second domain is already tag and it is now tag within tag.

Is there a way to modify regex pattern to ignore matching of second domain substring?

Perhaps by ignoring the '>' char at domain substring start? (or '<' char at the end)

Effectively generating this result:

text1 <a href='www.example.com'>www.example.com</a> text2 <a href='foo'>www.example.com</a> text3

How about using an HTML parser for the job? HTML doesn't lend itself to being messed around with by Regex. HtmlAgilityPack is good. — spender, Jan 23 '14 at 15:26
To be fair, it's not really HTML yet. But http://stackoverflow.com/a/1732454/1336590 is still a must read. What should happen to something like ` www.example.com ` (note the spaces)? Would it be enough to say that a match must not have a `>` directly before or a `<` directly after it? — Corak, Jan 23 '14 at 15:30
@Corak - I'm aware that space might happen between tag close char and domain substring start, like you described, but wanted to simplify question. — dzolnjan, Jan 23 '14 at 15:40
@Thiago - domain substrings are dynamic so it could be example.com, foo.bar.com.au, basically anything that looks like a valid domain name. — dzolnjan, Jan 23 '14 at 15:42
I guess question comes down to: can regex pattern be made to match www.example.com but ignore >www.example.com ? — dzolnjan, Jan 23 '14 at 15:44
I'm not all that familiar with html, but shouldn't anchor tags be like `foo`? — Jerry, Jan 23 '14 at 15:59
You can use this pattern: `@"([^<]+|<(?!/a>))+)|[A-Za-z0-9-]+(\.[A-Za-z0-9-]+)*(\.[A-Za-z]{2,})"` and a matchevaluator function. In the function: IF m.Groups[1] exists THEN return the whole match ELSE return your replacement. — Casimir et Hippolyte, Jan 23 '14 at 16:43

Srb1313711 · Accepted Answer · 2014-01-23T17:04:36.780

2

Try this:

 (?i)(?<!>)((w{3}\.)[^.]+\.[a-z]+(\.?[a-z])*)

This is assuming each domain begins with www. You can use your replace with this at will work unless the domain is preceded with a >. This may not be exactly what you are looking for but its somewhere to start, research negative look behinds as i believe this will help you.

edited Jan 23 '14 at 17:04

answered Jan 23 '14 at 15:56

Srb1313711

2,017
5
24
35

I guess there's a lost char in your regex `"z"` – Thiago Vinicius Jan 23 '14 at 16:03
How about without www. assumption? (example.com) – dzolnjan Jan 23 '14 at 16:04
Thought that might be an issue but like I said its somewhere to start, any improvements more than welcome:-) And yes @ThiagoVinicius that "z" was meant to be a "\" – Srb1313711 Jan 23 '14 at 16:31
@Srb1313711 - you definitely know your regex. If not too much to ask for one more hint: how would that regex look like to match 'example.com' but ignore 'example.com'? – dzolnjan Jan 23 '14 at 16:46
Nothings to much to ask this is precisely what this site is for:-) Im not sure i understand your question, if example.com only match example.com or match nothing? – Srb1313711 Jan 23 '14 at 17:01

score 0 · Answer 2 · answered Jan 23 '14 at 16:46

0

What you can also try is the following:

var pattern = @"(.*?)\s([\w*]+(\.{1}\w*)+)";

var result = Regex.Replace(input, pattern, string.Format("$1 <a href='$2'>$2</a>"), RegexOptions.None);

It would get all domains without the "www" as well.

answered Jan 23 '14 at 16:46

Thiago Vinicius

170
1
10

Regex replace domain substring with html tag in C#

2 Answers2