Python: confusions with urljoin

Question

I am trying to form URLs from different pieces, and having trouble understanding the behavior of this method. For example:

Python 3.x

from urllib.parse import urljoin

>>> urljoin('some', 'thing')
'thing'
>>> urljoin('http://some', 'thing')
'http://some/thing'
>>> urljoin('http://some/more', 'thing')
'http://some/thing'
>>> urljoin('http://some/more/', 'thing') # just a tad / after 'more'
'http://some/more/thing'
urljoin('http://some/more/', '/thing')
'http://some/thing'

Can you explain the exact behavior of this method?

Note to those coming across this question: the above import statement is for Python 3.x. Use "from urlparse import urljoin" for python 2.x. — Joe J, Feb 26 '14 at 03:20

score 138 · Accepted Answer · edited May 13 '20 at 18:25

138

The best way (for me) to think of this is the first argument, base is like the page you are on in your browser. The second argument url is the href of an anchor on that page. The result is the final url to which you will be directed should you click.

>>> urljoin('some', 'thing')
'thing'

This one makes sense given my description. Though one would hope base includes a scheme and domain.

>>> urljoin('http://some', 'thing')
'http://some/thing'

If you are on a vhost some, and there is an anchor like <a href='thing'>Foo</a> then the link will take you to http://some/thing

>>> urljoin('http://some/more', 'thing')
'http://some/thing'

We are on some/more here, so a relative link of thing will take us to /some/thing

>>> urljoin('http://some/more/', 'thing') # just a tad / after 'more'
'http://some/more/thing'

Here, we aren't on some/more, we are on some/more/ which is different. Now, our relative link will take us to some/more/thing

>>> urljoin('http://some/more/', '/thing')
'http://some/thing'

And lastly. If on some/more/ and the href is to /thing, you will be linked to some/thing.

edited May 13 '20 at 18:25

Antony Hatchkins

31,947
10
111
111

answered Jun 05 '12 at 07:39

sberry

128,281
18
138
165

10

Thanks for explaining... this kind of behaviour makes look for 'true' `urljoin`, acting similar to `os.path.join` – Evgeny Sep 27 '17 at 17:58
2

For those who also just want to add one bit of url onto another, without urljoin's logic, posixpath.join() may work for you. – Harabeck Dec 08 '20 at 15:07
I like `urljoin('http://', 'some/', 'thing')` how it ends up with: `'http:///some/'` ¯\\_(ツ)_/¯ – seb Jun 17 '21 at 09:33
1

@seb urljoin is not variadic, the third parameter is a boolean flag – misterManager Jun 24 '21 at 15:33
Thanks, I see. But still, the "///" is weird. – seb Jun 25 '21 at 18:43

Bar Horing · Answer 2 · 2018-07-29T07:24:16.227

urllib.parse.urljoin(base, url)

If url is an absolute URL (that is, starting with //, http://, https://, ...), the url’s host name and/or scheme will be present in the result. For example:

>>> urljoin('https://www.google.com', '//www.microsoft.com')
'https://www.microsoft.com'
>>>

otherwise, urllib.parse.urljoin(base, url) will

Construct a full (“absolute”) URL by combining a “base URL” (base) with another URL (url). Informally, this uses components of the base URL, in particular the addressing scheme, the network location and (part of) the path, to provide missing components in the relative URL.

>>> urlparse('http://a/b/c/d/e')
ParseResult(scheme='http', netloc='a', path='/b/c/d/e', params='', query='', fragment='')
>>> urljoin('http://a/b/c/d/e', 'f')
>>>'http://a/b/c/d/f'
>>> urlparse('http://a/b/c/d/e/')
ParseResult(scheme='http', netloc='a', path='/b/c/d/e/', params='', query='', fragment='')
>>> urljoin('http://a/b/c/d/e/', 'f')
'http://a/b/c/d/e/f'
>>>

it grabs the path of the first parameter (base), strips the part after the last / and joins with the second parameter (url).

If url starts with /, it joins the scheme and netloc of base with url

>>>urljoin('http://a/b/c/d/e', '/f')
'http://a/f'

score 1 · Answer 3 · answered Aug 02 '23 at 16:10

A picture is worth a thousand words.

$ python3
Python 3.11.4 (main, Jun 20 2023, 17:23:00) [Clang 14.0.3 (clang-1403.0.22.14.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> from urllib.parse import urljoin
>>> urljoin("http://a/b", "c/d")
'http://a/c/d'
>>> urljoin("http://a/b", "/c/d")
'http://a/c/d'
>>> urljoin("http://a/b/", "c/d")
'http://a/b/c/d'
>>> urljoin("http://a/b/", "/c/d")
'http://a/c/d'

The best practice is:

Use the "base" parameter with a trailing slash ("/"), and avoid starting the "url" parameter with a slash ("/").

Python: confusions with urljoin

3 Answers3

Linked

Related