0

Apologies for yet another regex question!

I have some input text which rather unhelpfully has multiple urls (only urls) all on one line with no separators

https://00e9e64bac25fa94607-apidata.googleusercontent.com/download/redacted?qk=AD5uMEnaGx-JIkLyJmEF7IjjU8bQfv_hZTkH_KOeaGZySsQCmdSPZEPHHAzUaUkcDAOZghttps://console.developers.google.com/project/reducted/?authuser=1\n

this example contains just two urls, but it could be more.

I'm trying to separate the urls, into a list using python

I've tried searching for solutions and tried a few but can't get this to work exactly, as they greedily consume all following urls. https://stackoverflow.com/a/6883094/659346

I realise that's probably because https://... could probably be legally allowed in the query part of a url, but in my case I'm willing to assume it can't, and assume that when it occurs it's the start of the next url.

I also tried (http[s]://.*?) but that with and without the ? either makes it get the whole bit of text or just the https://

Community
  • 1
  • 1
GP89
  • 6,600
  • 4
  • 36
  • 64

2 Answers2

3

You need to use a positive lookahead assertion.

>>> s = "https://00e9e64bac25fa94607-apidata.googleusercontent.com/download/redacted?qk=AD5uMEnaGx-JIkLyJmEF7IjjU8bQfv_hZTkH_KOeaGZySsQCmdSPZEPHHAzUaUkcDAOZghttps://console.developers.google.com/project/reducted/?authuser=1\n"
>>> re.findall(r'https?://.*?(?=https?://|$|\s)', s)
['https://00e9e64bac25fa94607-apidata.googleusercontent.com/download/redacted?qk=AD5uMEnaGx-JIkLyJmEF7IjjU8bQfv_hZTkH_KOeaGZySsQCmdSPZEPHHAzUaUkcDAOZg', 'https://console.developers.google.com/project/reducted/?authuser=1']
Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
0
(https?:\/\/(?:(?!https?:\/\/).)*)

Try this.See demo.

https://regex101.com/r/tX2bH4/15

import re
p = re.compile(r'(https?:\/\/(?:(?!https?:\/\/).)*)')
test_str = "https://00e9e64bac25fa94607-apidata.googleusercontent.com/download/redacted?qk=AD5uMEnaGx-JIkLyJmEF7IjjU8bQfv_hZTkH_KOeaGZySsQCmdSPZEPHHAzUaUkcDAOZghttps://console.developers.google.com/project/reducted/?authuser=1\n"

re.findall(p, test_str)
vks
  • 67,027
  • 10
  • 91
  • 124
  • 1
    This won't work if the url string contains "http" in the middle. – mbomb007 Jan 15 '15 at 15:24
  • yea I'd rather have the lookahead test for `http[s]?://` to make it a little more robust. Can't seem work out how to add the `://` to your answer though :S – GP89 Jan 15 '15 at 15:26
  • In the look ahead I mean – GP89 Jan 15 '15 at 15:27
  • Not in the later part – mbomb007 Jan 15 '15 at 15:27
  • `findAll()` should be used to return matches, but your regex only captures each in a capturing group instead. Matching them with non-capturing groups would be faster than returning the groups and then converting that to a list, since obtaining groups still requires the use of at least a call of `match()`. See here: https://docs.python.org/2/library/re.html – mbomb007 Jan 15 '15 at 22:57