We are using the following regex to recognize urls (derived from this gist by Jim Gruber). This is being executed in Scala using scala.util.matching
which in turn uses java.util.regex
:
(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?!js)[a-z]{2,6}/)(?:[^\s()<>{}\[\]]+)(?:[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?!js)[a-z]{2,6}\b/?(?!@)))
This version has escaped forward slashes, for Rubular:
(?i)\b(((?:https?:(?:\/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?!js)[a-z]{2,6}\/)(?:[^\s()<>{}\[\]]+)(?:[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?!js)[a-z]{2,6}\b\/?(?!@))))
Previously the front-end was only sending plaintext to the back end, however now they're allowing users to create anchor tags for urls. Therefore the back end now needs to recognize urls except for those that are already in anchor tags. I initially tried to accomplish this with a negative loohbehind, ignoring urls with a href="
prefix
(?i)\b((?<!href=")((?:https?: ... etc
The problem is that our url regex is very liberal, recognizing http://www.google.com
, www.google.com
, and google.com
- given
<a href="http://www.google.com">Google</a>
the negative lookbehind will ignore http://www.google.com
, but then the regex will still recognize www.google.com
. I'm wondering if there's a succinct way to tell the regex "ignore www.google.com
and google.com
if they are substrings of an ignored http(s)://www.google.com
"
At present I'm using a filter on the url regex matches (code is in Scala) - this also ignores urls in link text (<a href="http://www.google.com">www.google.com</a>
) by ignoring urls with a >
prefix and </a>
suffix. I'd rather stick with the filter if doing this in a regex would make an already complicated regex even more unreadable.
urlPattern.findAllMatchIn(text).toList.filter(m => {
val start: Int = m.start(1)
val end: Int = m.end(1)
val isHref: Boolean = (start - 6 > 0) &&
text.substring(start - 6, start) == """href=""""
val isAnchor: Boolean = (start - 1 > 0 && end + 3 < text.length &&
text.substring(start - 1, start) == ">" &&
text.substring(end, end + 3) == "</a>")
!(isHref || isAnchor) && Option(m.group(1)).isDefined
})