0

I'm looking for a regex for extracting urls when they are not separated by a space or whatever, but keep the "redirect" ones a a complete url.

Let me show you an example:

http://foo.barhttps://foo.bazhttp://foo.bar?url=http://foo.baz

should result in the following array:

['http://foo.bar', 'https://foo.baz', 'http://foo.bar?url=http://foo.baz']

I am able to separate urls joined thanks to this regex :

'~(?:https?:)?//.*?(?=$|(?:https?:)?//)~'

from this answer: Extract urls from string without spaces between

But I struggle to also extract the ones by keeping the =http

Thanks,

Guillaume Cisco
  • 2,859
  • 24
  • 25

2 Answers2

2

EDIT: for python

Use re.split and regex (?<!=)(?<!^)(?=https?://).

It will split on beginning of new url, unless this new url preceded by =, or first in line (to exclude redundunt split in the beginning of string)

>>> re.split(r'(?<!=)(?<!^)(?=https?://)', 'http://foo.barhttps://foo.bazhttp://foo.bar?url=http://foo.baz')
['http://foo.bar', 'https://foo.baz', 'http://foo.bar?url=http://foo.baz']

Demo and explanation at regex101.


Assuming (based on regex provided in question) you are using PHP:

Use preg_split and lookahead for https?:// and negative lookbehind with =|^ to exclude matching beginning of URL preceded by = and redundant split in the beginning of line.

<?php
$keywords = preg_split("~(?<!=|^)(?=https?://)~", "http://foo.barhttps://foo.bazhttp://foo.bar?url=http://foo.baz");
print_r($keywords);
?>

Outputs:

Array
(
    [0] => http://foo.bar
    [1] => https://foo.baz
    [2] => http://foo.bar?url=http://foo.baz
)

Online demo here.

Demo and explanation at regex101.

markalex
  • 8,623
  • 2
  • 7
  • 32
2

You could also update the pattern a bit to exclude a preceding = in the negative lookahead:

https?://\S*?(?=(?<!=)https?://|$)

Regex demo | Python demo

import re

pattern = r"https?://\S*?(?=(?<!=)https?://|$)"

s = ("http://foo.barhttps://foo.bazhttp://foo.bar?url=http://foo.baz\n"
    "http://foo.bar?url=http://foo.baz?foo=bar&baz=quxhttp://")

matches = re.findall(pattern, s, re.MULTILINE)
print(matches)

Output

[
  'http://foo.bar',
  'https://foo.baz',
  'http://foo.bar?url=http://foo.baz',
  'http://foo.bar?url=http://foo.baz?foo=bar&baz=qux',
  'http://'
]
The fourth bird
  • 154,723
  • 16
  • 55
  • 70