Problem while joining two URL components with urllib

Question

Recently I wanted to make a Python program which can crawl a website. I want to join the two components which should give the following output using urllib.parse.urljoin

https://test.com/endpoint + test.php =  https://test.com/endpoint/test.php

My code:

urllib.parse.urljoin('https://test.com/endpoint','test.php')

However, it is showing the following output:

https://test.com/test.php

Is there any way which can help me to get my desired output?

How exactly are you "using `urllib.parse.urljoin`"? What code produces this output? — ForceBru, May 16 '21 at 15:46
a = "https://test.com/endpoint" b = "test.php" urllib.parse.urljoin(a,b) — Faiyaz Ahmad, May 16 '21 at 15:48
How about appending a `/` to the base url before doing `urljoin`? e.g. `urljoin('https://test.com/endpoint' + '/', 'test.php')` — Masood Khaari, Sep 06 '22 at 12:13

Rivers · Answer 1 · 2023-07-10T10:49:16.280

That' because urllib.parse.urljoin is not made for this use case.

Example from the docs (https://docs.python.org/fr/3/library/urllib.parse.html#module-urllib.parse):

from urllib.parse import urljoin

new_url = urljoin('http://www.cwi.nl/%7Eguido/Python.html', 'FAQ.html')
print(new_url)

Output:

http://www.cwi.nl/%7Eguido/FAQ.html

As written in the doc, urllib.parse.urljoin constructs

a full ("absolute") URL by combining a "base URL" (base) with another URL (url).

In your example, you give "https://test.com/endpoint" as first parameter, so urllib.parse.urljoin will consider that the "base url" is "https://test.com/", and it will add what you pass as a second parameter (that is "test.php"), that's why your output is "https://test.com/test.php".

I think that you best option is to use the joinurl function posted by @tripleee, because it will not produce results like "endpoint//test.php" or "endpointtest.php".

But you should not use os.path.join if your code has to be cross platform. On Windows, you will get a backslash instead of a slash ("https://test.com/endpoint\test.php").

Here is a code sample for testing purposes:

def joinurl(baseurl, path):
    return '/'.join([baseurl.rstrip('/'), path.lstrip('/')])

url_base = "https://test.com/endpoint"
web_page_name = "/test.php"

desired_output = "https://test.com/endpoint/test.php"

assert(joinurl("https://test.com/endpoint", "test.php") == desired_output)
assert(joinurl("https://test.com/endpoint/", "test.php") == desired_output)
assert(joinurl("https://test.com/endpoint", "/test.php") == desired_output)
assert(joinurl("https://test.com/endpoint/", "/test.php") == desired_output)

tripleee · Answer 2 · 2021-05-16T16:51:59.110

1

The purpose of urljoin is to replace the last part of the path in the base URL. If that's not what you want, probably use a different function. Regular string joining would work well here, perhaps with a provision for normalizing slashes.

def joinurl(baseurl, path):
    return '/'.join([baseurl.rstrip('/'), path.lstrip('/')])

This is rather similar to os.path.join; maybe consider using that instead. (Of course, on Windows, where the system path separator is not a slash, it will do the wrong thing for URLs.)

edited May 16 '21 at 16:51

answered May 16 '21 at 16:02

tripleee

175,061
34
275
318

Great solution for this use case. Perhaps you could post it here too: https://stackoverflow.com/questions/1793261/how-to-join-components-of-a-path-when-you-are-constructing-a-url-in-python ? – Rivers May 16 '21 at 16:40
I don't think it adds anything significant over the existing answers there. – tripleee May 16 '21 at 16:47

Problem while joining two URL components with urllib

2 Answers2