Why is this line of regex capturing white spaces?

Question

I'm using the following line of regex which I found from this SO answer:

(?:[\w[a-z]-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.-]+[.??][a-z]{2,4}/)(?:[^\s()<>]+|(([^\s()<>]+|(([^\s()<>]+)))))+(?:(([^\s()<>]+|(([^\s()<>]+))))|[^\s`!()[]{};:'".,<>?«»“”‘’])

I am testing it on the following string:

"Quattro Amici in Concert Mar. 3, 2014. Long-time collaborators Lun Jiang, violin; Roberta Zalkind, viola; Pegsoon Whang, cello; and Karlyn Bond, piano, will perform works by Franz Joseph Haydn, Wolfgang Amadeus Mozart, Ludwig van Beethoven and Gabriel Faure. To purchase tickets visit westminstercollege.edu/culturalevents or call 801-832-2457. - See more at: http://entertainment.sltrib.com/events/view/quattro_amici_in_concert#sthash.QRsLXXiA.dpuf"

I'm simply attempting to extract urls from strings and based on a bunch of SO answers, I've found that regex is the recommended tool for that job. I'm not a regex expert (or even intermediate in my understanding), so I'm baffled by the empty strings my re.findall() keeps returning. I've stepped through the regex line using regex buddy and still no luck. Any help would be hugely appreciated.

Regular expressions that are longer than 40-80 characters are [garbage expressions](http://blog.codinghorror.com/regular-expressions-now-you-have-two-problems/) (according to me and probably a few others). — Ярослав Рахматуллин, Mar 04 '14 at 03:27

score 1 · Accepted Answer · answered Mar 04 '14 at 03:01

1

I'm not sure that a big regex like that is entirely necessary - if you're just looking to get links, you could use a much simpler regex, like this:

/(https?:\/\/[\w\d\$-_\.\+!\*'\(\),\/#]+)/ig

According to RFC 1738, urls are only allowed to use the characters specified in the class above, so it should cover any valid url, without such a gigantic mess of a regex.

You can also use a tool like regexpal.com to validate regexes, which helps find issues. That said, I pasted your regex in there and it crashed chrome, so it may not be a great help for a beast like that :)

answered Mar 04 '14 at 03:01

Jesse

10,370
10
62
81

If you're interested in the source of the big guy in your post, here's the original blog post where it was introduced: http://daringfireball.net/2010/07/improved_regex_for_matching_urls - unless your data set is very large and unpredictable however, it's overkill IMO. Even the author has made more specific regexes, the one you'd be looking for in your case would be this one: https://gist.github.com/gruber/8891611 – Jesse Mar 04 '14 at 03:04

Why is this line of regex capturing white spaces?

1 Answers1