2

I'm a relative newb when it comes to regexes, but i'm starting to get the hang of it. I started writing a method in java to "linkify" a string - that is, scan it for any references of urls (i.e, "http://...") or strings that look like web addresses ("www.example.com...")

So, for example, if I had a string that looked like this:

My favorite site is http://www.example.com.  What is yours?

After running it through the method, you'd get a string back that said:

My favorite site is <a href="http://www.example.com">http://www.example.com</a>.  What is yours?

After scouring the web for a while, I was finally able to piece together parts of different expressions that help me do what i'm looking for (Some examples include trailing periods at the end of urls in the actual url, some encode urls already in anchor tags, etc.)

Here is what I have so far:

public static String toLinkifiedString(String s, IAnchorBuilder anchorBuilder)
{
    if (IsNullOrEmpty(s))
    {
        return Empty;
    }

    String r = "(?<![=\"\"\\/>])(www\\.|(http|https|ftp|news|file)(s)?://)([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\@\\#\\$\\%\\^\\&amp;\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?([^.|'|# |!])";

    Pattern pattern = Pattern.compile(r, Pattern.DOTALL | Pattern.UNIX_LINES | Pattern.CASE_INSENSITIVE);
    Matcher matcher = pattern.matcher(s);
    if (anchorBuilder != null)
    {
        return matcher.replaceAll(anchorBuilder.createAnchorFromUrl("$0"));
    }
    return matcher.replaceAll("<a href=\"$0\">$0</a>"); // group 0 is the whole expression
}

public interface IAnchorBuilder
{
    public String createAnchorFromUrl(String url);
}

There is also simple verion of toLinkifiedString which only takes the string s - it just calls toLinkifiedString(s, null)

So like I said, this pattern is catching everything I need it to catch, and the replaceAll is working great for every case, except for when a link begins with www. If the match begins with "www" instead of a protocol, like "http" or "ftp", I want to conditionally prepend "http://" in front of the resultant link. That is:

MyClass.toLinkifiedString("go to www.example.org") 

should return

go to <a href="http://www.example.com">www.example.org</a>

The matching groups are as follows:

  • $0 - the actual url that gets found: http://www.example.org or www.example.net
  • $1 - the protocol match ("http://" or "www" for links w/o protocols)

I suppose what I want to be able to do, in pseudocode is something like:

matcher.replaceAll("<a href="(if protocol = "www", insert "http://" + url - otherwise, insert url">url</a>"

Is this possible? Or should I just be happy with being able to only create anchors from links that begin with "http://..." :)

Thanks for any help anyone can offer

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
matt
  • 9,113
  • 3
  • 44
  • 46
  • You don't need to use _quite_ so many backslashes. :D – Alan Moore Jun 10 '09 at 15:20
  • @mjd79: Your regex is quite a mess. Even if you are starting to get the hang of it, you should not copy examples off the Internets without fully understanding what they mean. I can see many wrong assumptions in it (about correct character escaping and about the mechanics of character classes). The question of how to find a URL in a text has been here many times, I suggest you look though SO by the means of Google. At least the regexes here usually come with a proven explanation. :) – Tomalak Jun 10 '09 at 18:00

2 Answers2

10

For your specific problem, definitely go with a callback function as Tomalak says.

For the problem of all those slashes, and the assorted other oddities...

Here is your current Java regex split across lines:

(?<![=\"\"\\/>])
(www\\.|(http|https|ftp|news|file)(s)?://)
([\\w+?\\.\\w+])+
([a-zA-Z0-9\\~\\!\\@\\#\\$\\%\\^\\&amp;\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?
([^.|'|# |!])

And the same thing as a non-Java regex (no Java string escapes):

(?<![=""\/>])
(www\.|(http|https|ftp|news|file)(s)?://)
([\w+?\.\w+])+
([a-zA-Z0-9\~\!\@\#\$\%\^\&amp;\*\(\)_\-\=\+\\\/\?\.\:\;\'\,]*)?
([^.|'|# |!])


And here's a description of what's wrong with it... :)

Line one - you're duplicating " in the character class, and don't need to escape /

Line two - ok, except I'm not sure what you're after with the (s)? part, since you have https within the previous group anyway.

Line three - you are aware that you've got a character class there? quantifiers don't work. You probably want (\w+?\.\w+)+ instead. (That's (\\w+?\\.\\w+)+ in a Java string.)

Line four - wow, what a lot of escaping!! Almost all unnecessary. Give this a go: ([a-zA-Z0-9~!@#$%^&*()_\-=+\/?.:;',]*)? (and again: ([a-zA-Z0-9~!@#$%^&*()_\\-=+\\/?.:;',]*)? )

Line five - alternation doesn't do anything inside a character class. This'll do: [^.'#!] , and add a single | if you actually want to prevent the pipe char from being there.

Putting all those comments together provides this regex:

(?<![="/>])
(www\.|(http|https|ftp|news|file)://)
(\w+?\.\w+)+
([a-zA-Z0-9~!@#$%^&*()_\-=+\/?.:;',]*)?
([^.'# !])

Or, yet again, with escaping for Java:

(?<![=\"/>])
(www\\.|(http|https|ftp|news|file)://)
(\\w+?\\.\\w+)+
([a-zA-Z0-9~!@#$%^&*()_\\-=+\\/?.:;',]*)?
([^.'# !])

Note how much simpler that is!

Going back on a single line for that gives:

(?<![="/>])(www\.|(http|https|ftp|news|file)://)(\w+?\.\w+)+([a-zA-Z0-9~!@#$%^&*()_\-=+\/?.:;',]*)?([^.'# !])

or

(?<![=\"/>])(www\\.|(http|https|ftp|news|file)://)(\\w+?\\.\\w+)+([a-zA-Z0-9~!@#$%^&*()_\\-=+\\/?.:;',]*)?([^.'# !])

But I'd stick to the multiline one - just plonk (?x) at the very start and it is a valid regex that ignores the whitespace, and you can use #s for commenting - always a good thing with regexes as long as this!

Peter Boughton
  • 110,170
  • 32
  • 120
  • 176
  • Though probably I would have left off the escaping of the backslashes and quotes, since this is a Java String requirement, not a regex requirement. Much of the uncertainty comes from the fact that people constantly keep confusing what escaping is required by what system - the experienced because they know, the unexperienced because they don't, ironically. – Tomalak Jun 12 '09 at 08:54
  • Hmmm, good point. I've gone and added examples without escaping to the answer. Hopefully I've not made it too confusing having both though... maybe I should completely remove the Java ones and just have a quick line or two about escaping? – Peter Boughton Jun 12 '09 at 22:27
  • 1
    Thanks for taking the time to *thoroughly* explain :) The reason for the escaping is actually more Intellij than me - it actually automatically escapes strings when you paste them in, a behavior that can grow quite annoying in some cases. – matt Jun 18 '09 at 03:49
4

Looks like you are in need of a callback function that returns a dynamic result you can use instead of the fixed string you currently have in replaceAll().

I guess you can make something out of the accepted answer to this question: Java equivalent to PHP's preg_replace_callback.

Community
  • 1
  • 1
Tomalak
  • 332,285
  • 67
  • 532
  • 628