1

I'm trying to use a regex to detect URLs in all the rows of my table, here's the regex

\b(([\w-]+:\/\/?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|\/)))

However, I invariably get the "repetition-operator operand invalid" error, which, after hours of search on the internet, still remains obscure. Where have I gone wrong? What can I do to fix this? And alternaltively, is there a better way to detect URLs in messages in SQL other than a Regex?

Thank you.

Unforgiven
  • 321
  • 2
  • 16
  • What about [Regexp to validate URL in MySQL](http://stackoverflow.com/a/9033154/3832970)? Does it answer your question? – Wiktor Stribiżew May 25 '15 at 10:49
  • @stribizhev I might need a more complex regex, otherwise not all URLs would be detected. The one above worked for me perfectly, and if I can help it, I would like to keep working with it. – Unforgiven May 25 '15 at 10:53
  • @stribizhev Getting the Repetition error for this one. As for the regex you suggested, I tried it, and it only detects a small portions of my URLs. – Unforgiven May 25 '15 at 11:00
  • Looks like the problem is with using `?` quantifier, `*?` and `+?` cannot be used in MySQL regexp. Also, not sure about non-capturing groups, replace them with capturing ones. Please re-try with `[[:<:]](([a-zA-Z0-9-]+:\/\/*|www[.])[^ ()<>]+(\([a-zA-Z0-9_]+\)|([^ [:punct:]]|\/)))`. – Wiktor Stribiżew May 25 '15 at 11:07
  • @stribizhev It works this. But the results improved just slightly. Are you sure that the regex hasn't changed at all? (This new syntax boggles me) – Unforgiven May 25 '15 at 11:13

1 Answers1

1

You cannot use ? quantifier in MySQL regex as the syntax is POSIX-based. Still, you can use * to match 0 or more characters. Also, \b in MySQL regex should be replaced with [[:<:]] (since this matches at the beginning of a word).

Thus, I suggest using

[[:<:]](([a-zA-Z0-9-]+:\/\/*|www[.])[^ ()<>]+(\([a-zA-Z0-9_]+\)|([^ [:punct:]]|\/)))

I am expanding \w to [a-zA-Z0-9_] as it is exactly what \w is. Instead of \s, I am using a literal space. Instead of \d, I am using [0-9]. This is done for readability and better compatibility. If \w, \d and \s work for you, you can use them, but I do not see them among the supported entities in POSIX specs.

Also, instead of literal space, you could use [:space:], it matches space, tab, newline, and carriage return. Instead of [a-zA-Z] you can use [:alpha:], and instead of [0-9], you can use [:digit:]. Please also check this:

[[:<:]](([[:alpha:][:digit:]-]+:\/\/*|www[.])[^[:space:]()<>]+(\([[:alpha:][:digit:]_]+\)|([^[:space:][:punct:]]|\/)))
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • It's working flawlessly this time. I'll make sure to read thoroughly through your answer so as to not get stuck twice. Thank you very much. And good day to you. – Unforgiven May 25 '15 at 11:33
  • You are welcome. BTW, I have just found an URL matching regex [here](http://stackoverflow.com/a/3508509/3832970) and tried to adapt it for you: `[[:<:]](https*://)*[a-zA-Z0-9_]+\.(\.*[a-zA-Z0-9_]+)+`. No idea if it works better. – Wiktor Stribiżew May 25 '15 at 11:34