regex to exclude if preceded by string?

Question

I haven't used regex much before but found something useful on the net that I'm using:

private string ConvertUrlsToLinks(string msg)
{
    string regex = @"((www\.|(http|https|ftp|news|file)+\:\/\/)[&#95;.a-z0-9-]+\.[a-z0-9\/&#95;:@=.+?,##%&~-]*[^.|\'|\\||\# |!|\(|?|\[|,| |>|<|;|\)])";
    Regex r = new Regex(regex, RegexOptions.IgnoreCase);
    return r.Replace(msg, "<a href=\"$1\" title=\"Click to open in a new window or tab\" target=\"&#95;blank\">$1</a>").Replace("href=\"www", "href=\"http://www").Replace(@"\r\n", "<br />").Replace(@"\n", "<br />").Replace(@"\r", "<br />");
}

It does a good job but now I want it to exclude urls that already have a "a href=" in front. There's the ending "/a" to consider too.

Can that be done with regex or have to use totally different approach, like coding?

I think you could use negative look ahead to exclude href=. I think it's (?!href=) or something like that. — Uncle Iroh, Jan 18 '15 at 01:24
Hmmm. It looks suspiciously like you're trying to use regex to parse HTML. — spender, Jan 18 '15 at 01:27
@spender -- that post about parsing html with regex is my favorite. — Uncle Iroh, Jan 18 '15 at 01:28
@Uncle Iroh thanks but where to put that exactly, tried with regexr.com no luck yet. — colin lamarre, Jan 18 '15 at 01:38
@spender well it's not making me a sandwich right now... useless! — colin lamarre, Jan 18 '15 at 01:43
@colinlamarre Have you read [the question](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags?page=1&tab=votes#1732454) spender and Uncle Iroh talk about? I [recommend](http://meta.stackoverflow.com/questions/267993/why-people-do-not-carefully-read-my-question-even-after-a-spending-a-bounty/268153#268153) you to rewrite your post without HTML sample - it generally bad idea to parse HTML with regex - HtmlAgilityPack (and maybe +`Uri` class in your case) is generally better approach if you care about result not the learning regex part. — Alexei Levenkov, Jan 18 '15 at 02:01
yeah, cause of it's global nature I'm not having much luck with negative look ahead and negative look behind isn't allowed in javascript. Sorry mate. — Uncle Iroh, Jan 18 '15 at 02:06
@Alexei Levenkov that post was too funny! I was heading for a melt down according to him, but not really, only dealing with urls here not tons HTML. — colin lamarre, Jan 18 '15 at 03:28
@Uncle Iroh it's good cuz .net regex supports lookbehind, found debuggex.com which supports it too in python mode. Still not obvious tho :) i sense a melt down. — colin lamarre, Jan 18 '15 at 03:34

score 0 · Answer 1 · answered Jan 18 '15 at 03:44

0

Try this:

((?<!href=')(?<!href=")(www\.|(http|https|ftp|news|file)+\:\/\/)[&#95;.a-z0-9-]+\.[a-z0-9\/&#95;:@=.+?,##%&~-]*[^.|\'|\\||\# |!|\(|?|\[|,| |>|<|;|\)])

I tested on regex101.com

With the following sample set:

www.google.com
http://hi.com
http://www.fishy.com
href='www.ignore.com'
www.ouch.com

answered Jan 18 '15 at 03:44

Uncle Iroh

5,748
6
48
61

the big problem was those websites, they're not working properly it seems, .net does a better job. So how to tell it to skip to the ending "/a" now? – colin lamarre Jan 18 '15 at 04:48

score 0 · Answer 2 · answered Jan 18 '15 at 04:14

0

Using your existing regex pattern you could make a few simple changes to handle additional text being prepended or appended to your string:

`.+` <- pattern -> `(.+)?`

Which would give you:

.+((www\.|(http|https|ftp|news|file)+\:\/\/)[&#95;.a-z0-9-]+\.[a-z0-9\/&#95;:@=.+?,##%&~-]*[^.|\'|\\||\# |!|\(|?|\[|,| |>|<|;|\)])(.+)?

So passing the string of either:

<a href='http://www.test.com'>http://www.test.com</a>

...or...

http://www.test.com

Would result in:

<a href="http://www.test.com" title="Click to open in a new window or tab" target="&#95;blank">www.test.com</a>

Examples:

https://regex101.com/r/bO0cW6/1

http://ideone.com/suVw3I

answered Jan 18 '15 at 04:14

l'L'l

44,951
10
95
146

thanks but the urls are buried in the middle of up to 65k data, so how to tell it between "a href" and "/a"? – colin lamarre Jan 18 '15 at 04:46
@colinlamarre, I'm not sure that I follow; if you provide an example string of what you mean maybe that would help. – l'L'l Jan 18 '15 at 04:50
it's just say paragraphs of text with some urls in the middle, some have href already, some dont, but the goal is make them all have hrefs. – colin lamarre Jan 18 '15 at 05:01

colin lamarre · Answer 3 · 2015-01-19T01:13:30.533

I think it would be a little ToNy tHe pOny to do that in regex after all, so wrote the code, in case anyone is interested here it is:

private string handleatag(string msg, string tagbegin, string tagend)
{
    ArrayList tags = new ArrayList();
    int tagbeginpos = msg.IndexOf(tagbegin);
    int tagendpos;

    string hash = tagbegin.GetHashCode().ToString();

    while (tagbeginpos != -1)
    {
        tagendpos = msg.IndexOf(tagend, tagbeginpos);

        if (tagendpos != -1)
        {
            string atag = msg.Substring(tagbeginpos, tagendpos - tagbeginpos + tagend.Length);
            msg = msg.Replace(atag, hash + tags.Count.ToString());
            tags.Add(atag);
        }
        else
            msg = msg.Remove(tagbeginpos, tagbegin.Length);

        tagbeginpos = msg.IndexOf(tagbegin, tagbeginpos);
    }

    msg = ConvertUrlsToLinks(msg);

    for (int i = 0; i < tags.Count; i++)
        msg = msg.Replace(hash + i.ToString(), tags[i].ToString());

    return msg;
}

private string ConvertUrlsToLinks(string msg)
{
    if (msg.IndexOf("<a href=") != -1)
        return handleatag(msg, "<a href=", "</a>");

    string regex = @"((www\.|(http|https|ftp|news|file)+\:\/\/)[&#95;.a-z0-9-]+\.[a-z0-9\/&#95;:@=.+?,##%&~-]*[^.|\'|\\||\# |!|\(|?|\[|,| |>|<|;|\)])";
    Regex r = new Regex(regex, RegexOptions.IgnoreCase);
    return r.Replace(msg, "<a href=\"$1\" title=\"Click to open in a new window or tab\" target=\"&#95;blank\">$1</a>").Replace("href=\"www", "href=\"http://www").Replace(@"\r\n", "<br />").Replace(@"\n", "<br />").Replace(@"\r", "<br />");
}

regex to exclude if preceded by string?

3 Answers3