0

I'm trying to replace plain domain like substrings of a input string with 'a' tags, using regex like this:

var pattern = @"[A-Za-z0-9-]+(\\.[A-Za-z0-9-]+)*(\\.[A-Za-z]{2,})";

var input = "text1 www.example.com text2 <a href='foo'>www.example.com</a> text3";

var result = Regex.Replace(input, pattern, string.Format("<a href='$0'>$0</a>"));

This will create following output:

text1 <a href='www.example.com'>www.example.com</a> text2 <a href='foo'><a href='www.example.com'>www.example.com</a></a> text3

Which is wrong as second domain is already tag and it is now tag within tag.

Is there a way to modify regex pattern to ignore matching of second domain substring?

Perhaps by ignoring the '>' char at domain substring start? (or '<' char at the end)

Effectively generating this result:

text1 <a href='www.example.com'>www.example.com</a> text2 <a href='foo'>www.example.com</a> text3
dzolnjan
  • 1,243
  • 4
  • 14
  • 26
  • 1
    How about using an HTML parser for the job? HTML doesn't lend itself to being messed around with by Regex. HtmlAgilityPack is good. – spender Jan 23 '14 at 15:26
  • 1
    To be fair, it's not really HTML yet. But http://stackoverflow.com/a/1732454/1336590 is still a must read. What should happen to something like ` www.example.com ` (note the spaces)? Would it be enough to say that a match must not have a `>` directly before or a `<` directly after it? – Corak Jan 23 '14 at 15:30
  • What's dynamic in your input and what's not? – Thiago Vinicius Jan 23 '14 at 15:35
  • @Corak - I'm aware that space might happen between tag close char and domain substring start, like you described, but wanted to simplify question. – dzolnjan Jan 23 '14 at 15:40
  • @Thiago - domain substrings are dynamic so it could be example.com, foo.bar.com.au, basically anything that looks like a valid domain name. – dzolnjan Jan 23 '14 at 15:42
  • I guess question comes down to: can regex pattern be made to match www.example.com but ignore >www.example.com ? – dzolnjan Jan 23 '14 at 15:44
  • what about the rest `text1 ... ... text3` is it fixed? – Thiago Vinicius Jan 23 '14 at 15:44
  • I'm not all that familiar with html, but shouldn't anchor tags be like `foo`? – Jerry Jan 23 '14 at 15:59
  • You can use this pattern: `@"([^<]+|<(?!/a>))+)|[A-Za-z0-9-]+(\.[A-Za-z0-9-]+)*(\.[A-Za-z]{2,})"` and a matchevaluator function. In the function: IF m.Groups[1] exists THEN return the whole match ELSE return your replacement. – Casimir et Hippolyte Jan 23 '14 at 16:43

2 Answers2

2

Try this:

 (?i)(?<!>)((w{3}\.)[^.]+\.[a-z]+(\.?[a-z])*)

This is assuming each domain begins with www. You can use your replace with this at will work unless the domain is preceded with a >. This may not be exactly what you are looking for but its somewhere to start, research negative look behinds as i believe this will help you.

Srb1313711
  • 2,017
  • 5
  • 24
  • 35
  • I guess there's a lost char in your regex `"z"` – Thiago Vinicius Jan 23 '14 at 16:03
  • How about without www. assumption? (example.com) – dzolnjan Jan 23 '14 at 16:04
  • Thought that might be an issue but like I said its somewhere to start, any improvements more than welcome:-) And yes @ThiagoVinicius that "z" was meant to be a "\" – Srb1313711 Jan 23 '14 at 16:31
  • @Srb1313711 - you definitely know your regex. If not too much to ask for one more hint: how would that regex look like to match 'example.com' but ignore 'example.com'? – dzolnjan Jan 23 '14 at 16:46
  • Nothings to much to ask this is precisely what this site is for:-) Im not sure i understand your question, if example.com only match example.com or match nothing? – Srb1313711 Jan 23 '14 at 17:01
0

What you can also try is the following:

var pattern = @"(.*?)\s([\w*]+(\.{1}\w*)+)";

var result = Regex.Replace(input, pattern, string.Format("$1 <a href='$2'>$2</a>"), RegexOptions.None);

It would get all domains without the "www" as well.

Thiago Vinicius
  • 170
  • 1
  • 10