Regex separate urls in text that has no separators

Question

Apologies for yet another regex question!

I have some input text which rather unhelpfully has multiple urls (only urls) all on one line with no separators

https://00e9e64bac25fa94607-apidata.googleusercontent.com/download/redacted?qk=AD5uMEnaGx-JIkLyJmEF7IjjU8bQfv_hZTkH_KOeaGZySsQCmdSPZEPHHAzUaUkcDAOZghttps://console.developers.google.com/project/reducted/?authuser=1\n

this example contains just two urls, but it could be more.

I'm trying to separate the urls, into a list using python

I've tried searching for solutions and tried a few but can't get this to work exactly, as they greedily consume all following urls. https://stackoverflow.com/a/6883094/659346

I realise that's probably because https://... could probably be legally allowed in the query part of a url, but in my case I'm willing to assume it can't, and assume that when it occurs it's the start of the next url.

I also tried (http[s]://.*?) but that with and without the ? either makes it get the whole bit of text or just the https://

score 3 · Accepted Answer · answered Jan 15 '15 at 15:26

You need to use a positive lookahead assertion.

>>> s = "https://00e9e64bac25fa94607-apidata.googleusercontent.com/download/redacted?qk=AD5uMEnaGx-JIkLyJmEF7IjjU8bQfv_hZTkH_KOeaGZySsQCmdSPZEPHHAzUaUkcDAOZghttps://console.developers.google.com/project/reducted/?authuser=1\n"
>>> re.findall(r'https?://.*?(?=https?://|$|\s)', s)
['https://00e9e64bac25fa94607-apidata.googleusercontent.com/download/redacted?qk=AD5uMEnaGx-JIkLyJmEF7IjjU8bQfv_hZTkH_KOeaGZySsQCmdSPZEPHHAzUaUkcDAOZg', 'https://console.developers.google.com/project/reducted/?authuser=1']

vks · Answer 2 · 2015-01-15T15:30:38.813

0

(https?:\/\/(?:(?!https?:\/\/).)*)

Try this.See demo.

https://regex101.com/r/tX2bH4/15

import re
p = re.compile(r'(https?:\/\/(?:(?!https?:\/\/).)*)')
test_str = "https://00e9e64bac25fa94607-apidata.googleusercontent.com/download/redacted?qk=AD5uMEnaGx-JIkLyJmEF7IjjU8bQfv_hZTkH_KOeaGZySsQCmdSPZEPHHAzUaUkcDAOZghttps://console.developers.google.com/project/reducted/?authuser=1\n"

re.findall(p, test_str)

edited Jan 15 '15 at 15:30

answered Jan 15 '15 at 15:22

vks

67,027
10
91
124

1

This won't work if the url string contains "http" in the middle. – mbomb007 Jan 15 '15 at 15:24
yea I'd rather have the lookahead test for `http[s]?://` to make it a little more robust. Can't seem work out how to add the `://` to your answer though :S – GP89 Jan 15 '15 at 15:26
In the look ahead I mean – GP89 Jan 15 '15 at 15:27
Not in the later part – mbomb007 Jan 15 '15 at 15:27
`findAll()` should be used to return matches, but your regex only captures each in a capturing group instead. Matching them with non-capturing groups would be faster than returning the groups and then converting that to a list, since obtaining groups still requires the use of at least a call of `match()`. See here: https://docs.python.org/2/library/re.html – mbomb007 Jan 15 '15 at 22:57

Regex separate urls in text that has no separators

2 Answers2