Remove recurrent (overlapping) slashes from string

Question

I m parsing URLS like this

>>> from urllib.parse import urlparse
>>> urlparse('http://foo.bar/path/to/heaven')
ParseResult(scheme='http', netloc='foo.bar', path='/path/to/heaven', params='', query='', fragment='')

Suppose I have an URL that has a malformed path with recurrent / like this:

>>> x = urlparse('http://foo.bar/path/to/////foo///baz//bar'))
ParseResult(scheme='http', netloc='foo.bar', path='/path/to/////foo///baz//bar', params='', query='', fragment='')

As you can see, the x.path still contain recurrent slashes, I'm trying to remove them so I have tried split and looping and replacing like this:

>>> newpath = x.path.split('/')
['', 'path', 'to', '', '', '', '', 'foo', '', '', 'baz', '', 'bar']
>>> for i in newpath:
    if i == '':
        newpath.remove('')
>>> '/'.join(newpath)
'/path/to/foo/baz/bar'

Which gives the desired output but i think this solution is inefficient and trash. How can I do it better?

@jsb No, I read about regex being slow and should be avoided at most, thats why i haven't thought of using them — tofu, Aug 06 '20 at 22:17

score 2 · Accepted Answer · answered Aug 06 '20 at 22:16

This is what regular expressions are made for:

import regex as re

url = "http://foo.bar/path/to/////foo///baz//bar"

rx = re.compile(r'(?:(?:http|ftp)s?://)(*SKIP)(*FAIL)|/+')
url = rx.sub('/', url)
print(url)

This yields

http://foo.bar/path/to/foo/baz/bar

See a demo on regex101.com. The only real problem is to leave any double forward slashes in the protocol as they are, hence the newer regex module and (*SKIP)(*FAIL). You could achieve the same functionality with lookbehinds in the re module.

do you think this way is faster than @barmar list comprehension answer or its negligible performance for millions of urls? — tofu, Aug 06 '20 at 22:29

Andrej Kesely · Answer 2 · 2020-08-06T22:39:11.947

0

import re

s = 'http://foo.bar/path/to/////foo///baz//bar'

s = re.sub(r'(?<!:)/{2,}', '/', s)
print(s)

Prints:

http://foo.bar/path/to/foo/baz/bar

EDIT: Compiling regex:

import re

s = 'http://foo.bar/path/to/////foo///baz//bar'
r = re.compile(r'(?<!:)/{2,}')

s = r.sub('/', s)
print(s)

edited Aug 06 '20 at 22:39

answered Aug 06 '20 at 22:18

Andrej Kesely

168,389
15
48
91

do you think this way is faster than @barmar list comprehension and Jan regex answer or its negligible performance for millions of urls? – tofu Aug 06 '20 at 22:36
@tofu Only way is to test it. But compile the regex before-hand to speed things up (see my edit.) – Andrej Kesely Aug 06 '20 at 22:38

Barmar · Answer 3 · 2020-08-06T22:24:10.580

0

You shouldn't modify a list that you're iterating over. See strange result when removing item from a list

You can use a list comprehension to create a list without all the '' elements.

newpath = [s in x.path.split('/') if s != '']
'/'.join(newpath)

edited Aug 06 '20 at 22:24

answered Aug 06 '20 at 22:19

Barmar

741,623
53
500
612

do you think this way is faster than regex or its negligible performance for millions of urls? – tofu Aug 06 '20 at 22:29

Remove recurrent (overlapping) slashes from string

3 Answers3