3

Recently I wanted to make a Python program which can crawl a website. I want to join the two components which should give the following output using urllib.parse.urljoin

https://test.com/endpoint + test.php =  https://test.com/endpoint/test.php

My code:

urllib.parse.urljoin('https://test.com/endpoint','test.php')

However, it is showing the following output:

https://test.com/test.php

Is there any way which can help me to get my desired output?

Rivers
  • 1,783
  • 1
  • 8
  • 27
Faiyaz Ahmad
  • 87
  • 1
  • 6

2 Answers2

1

That' because urllib.parse.urljoin is not made for this use case.

Example from the docs (https://docs.python.org/fr/3/library/urllib.parse.html#module-urllib.parse):

from urllib.parse import urljoin

new_url = urljoin('http://www.cwi.nl/%7Eguido/Python.html', 'FAQ.html')
print(new_url)

Output:

http://www.cwi.nl/%7Eguido/FAQ.html

As written in the doc, urllib.parse.urljoin constructs

a full ("absolute") URL by combining a "base URL" (base) with another URL (url).

In your example, you give "https://test.com/endpoint" as first parameter, so urllib.parse.urljoin will consider that the "base url" is "https://test.com/", and it will add what you pass as a second parameter (that is "test.php"), that's why your output is "https://test.com/test.php".

I think that you best option is to use the joinurl function posted by @tripleee, because it will not produce results like "endpoint//test.php" or "endpointtest.php".

But you should not use os.path.join if your code has to be cross platform. On Windows, you will get a backslash instead of a slash ("https://test.com/endpoint\test.php").

Here is a code sample for testing purposes:

def joinurl(baseurl, path):
    return '/'.join([baseurl.rstrip('/'), path.lstrip('/')])

url_base = "https://test.com/endpoint"
web_page_name = "/test.php"

desired_output = "https://test.com/endpoint/test.php"

assert(joinurl("https://test.com/endpoint", "test.php") == desired_output)
assert(joinurl("https://test.com/endpoint/", "test.php") == desired_output)
assert(joinurl("https://test.com/endpoint", "/test.php") == desired_output)
assert(joinurl("https://test.com/endpoint/", "/test.php") == desired_output)
Rivers
  • 1,783
  • 1
  • 8
  • 27
1

The purpose of urljoin is to replace the last part of the path in the base URL. If that's not what you want, probably use a different function. Regular string joining would work well here, perhaps with a provision for normalizing slashes.

def joinurl(baseurl, path):
    return '/'.join([baseurl.rstrip('/'), path.lstrip('/')])

This is rather similar to os.path.join; maybe consider using that instead. (Of course, on Windows, where the system path separator is not a slash, it will do the wrong thing for URLs.)

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • Great solution for this use case. Perhaps you could post it here too: https://stackoverflow.com/questions/1793261/how-to-join-components-of-a-path-when-you-are-constructing-a-url-in-python ? – Rivers May 16 '21 at 16:40
  • I don't think it adds anything significant over the existing answers there. – tripleee May 16 '21 at 16:47