Extract joined urls but not if redirect exists

Question

I'm looking for a regex for extracting urls when they are not separated by a space or whatever, but keep the "redirect" ones a a complete url.

Let me show you an example:

http://foo.barhttps://foo.bazhttp://foo.bar?url=http://foo.baz

should result in the following array:

['http://foo.bar', 'https://foo.baz', 'http://foo.bar?url=http://foo.baz']

I am able to separate urls joined thanks to this regex :

'~(?:https?:)?//.*?(?=$|(?:https?:)?//)~'

from this answer: Extract urls from string without spaces between

But I struggle to also extract the ones by keeping the =http

Thanks,

Use a negative lookbehind that matches `=` to keep it from breaking those up. — Barmar, May 09 '23 at 20:21
Why do you have `~` at the beginning and end? Are you using PHP so it requires a delimiter around it? — Barmar, May 09 '23 at 20:22
How would you split `http://foo.bar?url=http://foo.baz?foo=bar&baz=quxhttp://`? — InSync, May 09 '23 at 20:51
`['http://foo.bar?url=http://foo.baz?foo=bar&baz=qux', 'http://']` but in the redirect url, special characters should be encoded of course — Guillaume Cisco, May 09 '23 at 20:52

markalex · Accepted Answer · 2023-05-09T20:57:31.997

2

EDIT: for python

Use re.split and regex (?<!=)(?<!^)(?=https?://).

It will split on beginning of new url, unless this new url preceded by =, or first in line (to exclude redundunt split in the beginning of string)

>>> re.split(r'(?<!=)(?<!^)(?=https?://)', 'http://foo.barhttps://foo.bazhttp://foo.bar?url=http://foo.baz')
['http://foo.bar', 'https://foo.baz', 'http://foo.bar?url=http://foo.baz']

Demo and explanation at regex101.

Assuming (based on regex provided in question) you are using PHP:

Use preg_split and lookahead for https?:// and negative lookbehind with =|^ to exclude matching beginning of URL preceded by = and redundant split in the beginning of line.

<?php
$keywords = preg_split("~(?<!=|^)(?=https?://)~", "http://foo.barhttps://foo.bazhttp://foo.bar?url=http://foo.baz");
print_r($keywords);
?>

Outputs:

Array
(
    [0] => http://foo.bar
    [1] => https://foo.baz
    [2] => http://foo.bar?url=http://foo.baz
)

Online demo here.

Demo and explanation at regex101.

edited May 09 '23 at 20:57

answered May 09 '23 at 20:25

markalex

8,623
2
7
32

I tried to use it with python: re.findall('(?<!=|^)(?=https?://)', 'http://foo.barhttps://foo.bazhttp://foo.bar?url=http://foo.baz') but got an error re.error: look-behind requires fixed-width pattern – Guillaume Cisco May 09 '23 at 20:44
1

@GuillaumeCisco, updated for python. – markalex May 09 '23 at 20:56
Yes with 2 separate lookbehinds it will work in Python – The fourth bird May 09 '23 at 20:56
1

@Thefourthbird, I know. First version of the answer was based on the assumption that this is for php. I was to eager to answer before clearing all details( – markalex May 09 '23 at 20:59

score 2 · Answer 2 · answered May 09 '23 at 20:59

You could also update the pattern a bit to exclude a preceding = in the negative lookahead:

https?://\S*?(?=(?<!=)https?://|$)

Regex demo | Python demo

import re

pattern = r"https?://\S*?(?=(?<!=)https?://|$)"

s = ("http://foo.barhttps://foo.bazhttp://foo.bar?url=http://foo.baz\n"
    "http://foo.bar?url=http://foo.baz?foo=bar&baz=quxhttp://")

matches = re.findall(pattern, s, re.MULTILINE)
print(matches)

Output

[
  'http://foo.bar',
  'https://foo.baz',
  'http://foo.bar?url=http://foo.baz',
  'http://foo.bar?url=http://foo.baz?foo=bar&baz=qux',
  'http://'
]

Extract joined urls but not if redirect exists

2 Answers2