How to join components of a path when you are constructing a URL in Python

Question

For example, I want to join a prefix path to resource paths like /js/foo.js.

I want the resulting path to be relative to the root of the server. In the above example if the prefix was "media" I would want the result to be /media/js/foo.js.

os.path.join does this really well, but how it joins paths is OS dependent. In this case I know I am targeting the web, not the local file system.

Is there a best alternative when you are working with paths you know will be used in URLs? Will os.path.join work well enough? Should I just roll my own?

`os.path.join` will not work. But simply joining by the `/` character should work in all cases -- `/` is the standard path separator in HTTP per the specification. — intgr, Nov 24 '09 at 22:15

score 241 · Answer 1 · edited May 31 '20 at 16:58

241

You can use urllib.parse.urljoin:

>>> from urllib.parse import urljoin
>>> urljoin('/media/path/', 'js/foo.js')
'/media/path/js/foo.js'

But beware:

>>> urljoin('/media/path', 'js/foo.js')
'/media/js/foo.js'
>>> urljoin('/media/path', '/js/foo.js')
'/js/foo.js'

The reason you get different results from /js/foo.js and js/foo.js is because the former begins with a slash which signifies that it already begins at the website root.

On Python 2, you have to do

from urlparse import urljoin

edited May 31 '20 at 16:58

Boris Verkhovskiy

14,854
11
100
103

answered Nov 24 '09 at 22:10

Ben James

121,135
26
193
155

1

So I have the strip off the leading "/" on /js/foo.js, but it seems that would be the case with os.path.join too. Requiring the slash after media means I have to most of the work myself anyway. – amjoconn Nov 24 '09 at 22:16
1

Specifically once I have that the prefix has to ends in / and that the target path can't begin in / I might as well just concatenate. In this case I am not sure if urljoin is really helping? – amjoconn Nov 24 '09 at 22:20
@amjoconn The advantage of using urlparse.urljoin is that it removes duplicate slashes between the joined parts of the url so you don't have to worry about manually checking these and you can just concentrate on adding / removing the slashes at the beginning or end of the resulting url. – Medhat Gayed Feb 25 '14 at 20:27
3

@MedhatGayed It isn't clear to me that `urljoin` ever removes '/'. If I call it with `urlparse.urljoin('/media/', '/js/foo.js')` the returned value is '/js/foo.js'. It removed all of media, not the duplicate '/'. In fact `urlparse.urljoin('/media//', 'js/foo.js')` actually returns '/media//js/foo.js', so no duplicated removed. – amjoconn Jul 31 '14 at 11:26
20

urljoin has weird behavior if you are joining a components that don't end in / it strips the first component to it's base and then joins the other args on. Not what I would expect. – Pete Apr 26 '15 at 04:51
11

Unfortunately `urljoin` is not for joining URLs. It it for resolving relative URLs as found in HTML documents, etc. – OrangeDog Aug 15 '16 at 10:27
`urljoin` is also limited to certain schemes (listed in the documentation), and will not do what you want for other schemes (just returns the second argument). – Sam Brightman Nov 27 '16 at 14:11
This is not a documented use case for `urljoin`. The purpose of this function is to _Construct a full (“absolute”) URL by combining a “base URL” (base) with another URL (url)._ See: https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urljoin – Andrew Palmer Sep 28 '17 at 14:01
`urljoin('/media/path', 'js/foo.js')` ---> `'/media/js/foo.js'`. WHAT? – richardsonwtr Feb 10 '23 at 13:37

score 87 · Accepted Answer · answered Nov 25 '09 at 04:05

87

Since, from the comments the OP posted, it seems he doesn't want to preserve "absolute URLs" in the join (which is one of the key jobs of urlparse.urljoin;-), I'd recommend avoiding that. os.path.join would also be bad, for exactly the same reason.

So, I'd use something like '/'.join(s.strip('/') for s in pieces) (if the leading / must also be ignored -- if the leading piece must be special-cased, that's also feasible of course;-).

answered Nov 25 '09 at 04:05

Alex Martelli

854,459
170
1,222
1,395

1

Thanks. I didn't mind so much requiring that the leading '/' on the second part couldn't be there, but requiring the trailing '/' on the first part make me feel as if in this use case urljoin wasn't doing anything for me. I would like at least join("/media", "js/foo.js") and join("/media/", "js/foo.js") to work. Thanks for what appears to be the right answer: roll your own. – amjoconn Nov 25 '09 at 14:42
1

I hoped something would do the '/' stripping and joining for me. – Cory Jun 25 '18 at 18:43
2

Nope, this is not going to work on windows, where `os.path.join('http://media.com', 'content')` wourd return `http://media.com\content`. – SeF Mar 18 '20 at 11:16

GP89 · Answer 3 · 2017-02-07T10:56:23.097

Like you say, os.path.join joins paths based on the current os. posixpath is the underlying module that is used on posix systems under the namespace os.path:

>>> os.path.join is posixpath.join
True
>>> posixpath.join('/media/', 'js/foo.js')
'/media/js/foo.js'

So you can just import and use posixpath.join instead for urls, which is available and will work on any platform.

Edit: @Pete's suggestion is a good one, you can alias the import for increased readability

from posixpath import join as urljoin

Edit: I think this is made clearer, or at least helped me understand, if you look into the source of os.py (the code here is from Python 2.7.11, plus I've trimmed some bits). There's conditional imports in os.py that picks which path module to use in the namespace os.path. All the underlying modules (posixpath, ntpath, os2emxpath, riscospath) that may be imported in os.py, aliased as path, are there and exist to be used on all systems. os.py is just picking one of the modules to use in the namespace os.path at run time based on the current OS.

# os.py
import sys, errno

_names = sys.builtin_module_names

if 'posix' in _names:
    # ...
    from posix import *
    # ...
    import posixpath as path
    # ...

elif 'nt' in _names:
    # ...
    from nt import *
    # ...
    import ntpath as path
    # ...

elif 'os2' in _names:
    # ...
    from os2 import *
    # ...
    if sys.version.find('EMX GCC') == -1:
        import ntpath as path
    else:
        import os2emxpath as path
        from _emx_link import link
    # ...

elif 'ce' in _names:
    # ...
    from ce import *
    # ...
    # We can use the standard Windows path.
    import ntpath as path

elif 'riscos' in _names:
    # ...
    from riscos import *
    # ...
    import riscospath as path
    # ...

else:
    raise ImportError, 'no os specific module found'

`from posixpath import join as urljoin` nicely aliases it to something easy to read. — Pete, Apr 26 '15 at 04:49

Rune Kaagaard · Answer 4 · 2018-08-27T08:06:22.607

39

This does the job nicely:

def urljoin(*args):
    """
    Joins given arguments into an url. Trailing but not leading slashes are
    stripped for each argument.
    """

    return "/".join(map(lambda x: str(x).rstrip('/'), args))

edited Aug 27 '18 at 08:06

answered Jul 04 '12 at 09:28

Rune Kaagaard

6,643
2
38
29

score 12 · Answer 5 · answered Sep 21 '19 at 04:52

I found things not to like about all the above solutions, so I came up with my own. This version makes sure parts are joined with a single slash and leaves leading and trailing slashes alone. No pip install, no urllib.parse.urljoin weirdness.

In [1]: from functools import reduce

In [2]: def join_slash(a, b):
   ...:     return a.rstrip('/') + '/' + b.lstrip('/')
   ...:

In [3]: def urljoin(*args):
   ...:     return reduce(join_slash, args) if args else ''
   ...:

In [4]: parts = ['https://foo-bar.quux.net', '/foo', 'bar', '/bat/', '/quux/']

In [5]: urljoin(*parts)
Out[5]: 'https://foo-bar.quux.net/foo/bar/bat/quux/'

In [6]: urljoin('https://quux.com/', '/path', 'to/file///', '//here/')
Out[6]: 'https://quux.com/path/to/file/here/'

In [7]: urljoin()
Out[7]: ''

In [8]: urljoin('//','beware', 'of/this///')
Out[8]: '/beware/of/this///'

In [9]: urljoin('/leading', 'and/', '/trailing/', 'slash/')
Out[9]: '/leading/and/trailing/slash/'

I am always happy when a solution involves functools like reduce — Siddharth Pant, Jun 18 '21 at 11:02

mwcz · Answer 6 · 2009-11-24T22:18:34.550

10

The basejoin function in the urllib package might be what you're looking for.

basejoin = urljoin(base, url, allow_fragments=True)
    Join a base URL and a possibly relative URL to form an absolute
    interpretation of the latter.

Edit: I didn't notice before, but urllib.basejoin seems to map directly to urlparse.urljoin, making the latter preferred.

edited Nov 24 '09 at 22:18

answered Nov 24 '09 at 22:10

mwcz

8,949
10
42
63

score 9 · Answer 7 · answered Oct 04 '17 at 13:39

9

Using furl, pip install furl it will be:

 furl.furl('/media/path/').add(path='js/foo.js')

answered Oct 04 '17 at 13:39

Vasili Pascal

3,102
1
27
21

1

If you want the result to be a string you can add `.url` at the end: `furl.furl('/media/path/').add(path='js/foo.js').url` – Eyal Levin Oct 31 '17 at 12:15
furl works better in joining URL compared to urlparse.urljoin in python 2 atleast (y) – Ciasto piekarz Jan 04 '18 at 04:00
It's better to do `furl('/media/path/').add(path=furl('/js/foo.js').path).url` because `furl('/media/path/').add(path='/js/foo.js').url` is `/media/path//js/foo.js` – bartolo-otrit Jan 24 '19 at 09:37

score 5 · Answer 8 · answered Mar 22 '15 at 17:19

I know this is a bit more than the OP asked for, However I had the pieces to the following url, and was looking for a simple way to join them:

>>> url = 'https://api.foo.com/orders/bartag?spamStatus=awaiting_spam&page=1&pageSize=250'

Doing some looking around:

>>> split = urlparse.urlsplit(url)
>>> split
SplitResult(scheme='https', netloc='api.foo.com', path='/orders/bartag', query='spamStatus=awaiting_spam&page=1&pageSize=250', fragment='')
>>> type(split)
<class 'urlparse.SplitResult'>
>>> dir(split)
['__add__', '__class__', '__contains__', '__delattr__', '__dict__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__getslice__', '__getstate__', '__gt__', '__hash__', '__init__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmul__', '__setattr__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '__weakref__', '_asdict', '_fields', '_make', '_replace', 'count', 'fragment', 'geturl', 'hostname', 'index', 'netloc', 'password', 'path', 'port', 'query', 'scheme', 'username']
>>> split[0]
'https'
>>> split = (split[:])
>>> type(split)
<type 'tuple'>

So in addition to the path joining which has already been answered in the other answers, To get what I was looking for I did the following:

>>> split
('https', 'api.foo.com', '/orders/bartag', 'spamStatus=awaiting_spam&page=1&pageSize=250', '')
>>> unsplit = urlparse.urlunsplit(split)
>>> unsplit
'https://api.foo.com/orders/bartag?spamStatus=awaiting_spam&page=1&pageSize=250'

According to the documentation it takes EXACTLY a 5 part tuple.

With the following tuple format:

scheme 0 URL scheme specifier empty string

netloc 1 Network location part empty string

path 2 Hierarchical path empty string

query 3 Query component empty string

fragment 4 Fragment identifier empty string

score 5 · Answer 9 · answered Apr 11 '19 at 20:51

5

Rune Kaagaard provided a great and compact solution that worked for me, I expanded on it a little:

def urljoin(*args):
    trailing_slash = '/' if args[-1].endswith('/') else ''
    return "/".join(map(lambda x: str(x).strip('/'), args)) + trailing_slash

This allows all arguments to be joined regardless of trailing and ending slashes while preserving the last slash if present.

answered Apr 11 '19 at 20:51

futuere

59
1
2

1

You can make that last line a little shorter and more Pythonic by using a list comprehension, like: `return "/".join([str(x).strip("/") for x in args]) + trailing_slash` – Dan Coates Jun 06 '20 at 20:47

score 3 · Answer 10 · answered Sep 22 '17 at 09:00

To improve slightly over Alex Martelli's response, the following will not only cleanup extra slashes but also preserve trailing (ending) slashes, which can sometimes be useful :

>>> items = ["http://www.website.com", "/api", "v2/"]
>>> url = "/".join([(u.strip("/") if index + 1 < len(items) else u.lstrip("/")) for index, u in enumerate(items)])
>>> print(url)
http://www.website.com/api/v2/

It's not as easy to read though, and won't cleanup multiple extra trailing slashes.

Andrew · Answer 11 · 2021-02-16T17:07:32.850

How about this: It is Somewhat Efficient & Somewhat Simple. Only need to join '2' parts of url path:

def UrlJoin(a , b):
    a, b = a.strip(), b.strip()
    a = a if a.endswith('/') else a + '/'
    b = b if not b.startswith('/') else b[1:]
    return a + b

OR: More Conventional, but Not as efficient if joining only 2 url parts of a path.

def UrlJoin(*parts):
    return '/'.join([p.strip().strip('/') for p in parts])

Test Cases:

>>> UrlJoin('https://example.com/', '/TestURL_1')
'https://example.com/TestURL_1'

>>> UrlJoin('https://example.com', 'TestURL_2')
'https://example.com/TestURL_2'

Note: I may be splitting hairs here, but it is at least good practice and potentially more readable.

score 1 · Answer 12 · edited Sep 21 '19 at 03:57

Using furl and regex (python 3)

>>> import re
>>> import furl
>>> p = re.compile(r'(\/)+')
>>> url = furl.furl('/media/path').add(path='/js/foo.js').url
>>> url
'/media/path/js/foo.js'
>>> p.sub(r"\1", url)
'/media/path/js/foo.js'
>>> url = furl.furl('/media/path').add(path='js/foo.js').url
>>> url
'/media/path/js/foo.js'
>>> p.sub(r"\1", url)
'/media/path/js/foo.js'
>>> url = furl.furl('/media/path/').add(path='js/foo.js').url
>>> url
'/media/path/js/foo.js'
>>> p.sub(r"\1", url)
'/media/path/js/foo.js'
>>> url = furl.furl('/media///path///').add(path='//js///foo.js').url
>>> url
'/media///path/////js///foo.js'
>>> p.sub(r"\1", url)
'/media/path/js/foo.js'

score 1 · Answer 13 · answered Aug 06 '21 at 13:15

1

One liner:

from functools import reduce
reduce(lambda x,y: '{}/{}'.format(x,y), parts)

where parts is e.g ['https://api.somecompany.com/v1', 'weather', 'rain']

answered Aug 06 '21 at 13:15

Arindam Roychowdhury

5,927
5
55
63

score 1 · Answer 14 · answered Jan 17 '23 at 06:59

Here's a safe version, I'm using. It takes care of prefixes and trailing slashes. The trailing slash for the end URI is handled separately

def safe_urljoin(*uris) -> str:
    """
    Joins the URIs carefully considering the prefixes and trailing slashes.
    The trailing slash for the end URI is handled separately.
    """
    if len(uris) == 1:
        return uris[0]

    safe_urls = [
        f"{url.lstrip('/')}/" if not url.endswith("/") else url.lstrip("/")
        for url in uris[:-1]
    ]
    safe_urls.append(uris[-1].lstrip("/"))
    return "".join(safe_urls)

The output

>>> safe_urljoin("https://a.com/", "adunits/", "/both/", "/left")
>>> 'https://a.com/adunits/both/left'

>>> safe_urljoin("https://a.com/", "adunits/", "/both/", "right/")
>>> 'https://a.com/adunits/both/right/'

>>> safe_urljoin("https://a.com/", "adunits/", "/both/", "right/", "none")
>>> 'https://a.com/adunits/both/right/none'

>>> safe_urljoin("https://a.com/", "adunits/", "/both/", "right/", "none/")
>>> 'https://a.com/adunits/both/right/none/'

score 0 · Answer 15 · answered May 19 '22 at 10:23

Yet another variation with unique features:

def urljoin(base:str, *parts:str) -> str:
    for part in filter(None, parts):
        base = '{}/{}'.format(base.rstrip('/'), part.lstrip('/'))
    return base

Preserve trailing slash in base or last part
Empty parts are ignored
For each non-empty part, remove trailing from base and leading from part and join with a single /

urljoin('http://a.com/api',  '')  -> 'http://a.com/api'
urljoin('http://a.com/api',  '/') -> 'http://a.com/api/'
urljoin('http://a.com/api/', '')  -> 'http://a.com/api/'
urljoin('http://a.com/api/', '/') -> 'http://a.com/api/'
urljoin('http://a.com/api/', '/a/', '/b', 'c', 'd/') -> 'http://a.com/api/a/b/c/d/'

Zio · Answer 16 · 2022-09-03T11:38:00.077

Ok, that's what I did, because I needed complete independence from predefined roots:

def url_join(base: str, *components: str, slash_left=True, slash_right=True) -> str:
    """Join two or more url components, inserting '/' as needed.
    Optionally, a slash can be added to the left or right side of the URL.
    """
    base = base.lstrip('/').rstrip('/')
    components = [component.lstrip('/').rstrip('/') for component in components]
    url = f"/{base}" if slash_left else base
    for component in components:
        url = f"{url}/{component}" 
    return f"{url}/" if slash_right else url

url_join("http://whoops.io", "foo/", "/bar", "foo", slash_left=False)
# "http://whoops.io/foo/bar/foo/"
url_join("foo", "bar")
# "/foo/bar/""

How to join components of a path when you are constructing a URL in Python

16 Answers16

Linked

Related