31

I'm trying to determine the base of a URL, or everything besides the page and parameters. I tried using split, but is there a better way than splitting it up into pieces? Is there a way I can remove everything from the last '/'?

Given this: http://127.0.0.1/asdf/login.php

I would like: http://127.0.0.1/asdf/

Brendan
  • 1,403
  • 4
  • 18
  • 37
  • `re.sub(r"[^/]*(\?.*)?$", "", x)` – Amadan Feb 25 '16 at 01:17
  • This may be considered as cheating, but you could use `os.path.dirname()`. I'm not sure if that would work on Windows, but it works on Linux. – zondo Feb 25 '16 at 01:19
  • @zondo: I'm on Windows, and it definitely worked for me (on Py 3.5.1). – ShadowRanger Feb 25 '16 at 01:23
  • @ShadowRanger: No fair taking my idea. :( What do I care? I upvoted anyway. – zondo Feb 25 '16 at 01:26
  • @zondo: I actually posted my answer before your comment. :-) I have since edited to add some alternatives and clarification (though no edit history is shown, odd), but it was literally the first thing I tried. It does help that I happen to be on Windows, so I could quickly confirm that it worked on Windows too. – ShadowRanger Feb 25 '16 at 01:28
  • @ShadowRanger: I didn't notice that. You actually posted it one minute before I did. I think there is no edit history because you edited soon enough. It is the first thing I tried, too. It just looks so much like a file path, why couldn't `os.path.dirname()` do it, right? I feel sorry for you being on Windows... – zondo Feb 25 '16 at 01:30

8 Answers8

38

The best way to do this is use urllib.parse.

From the docs:

The module has been designed to match the Internet RFC on Relative Uniform Resource Locators. It supports the following URL schemes: file, ftp, gopher, hdl, http, https, imap, mailto, mms, news, nntp, prospero, rsync, rtsp, rtspu, sftp, shttp, sip, sips, snews, svn, svn+ssh, telnet, wais, ws, wss.

You'd want to do something like this using urlsplit and urlunsplit:

from urllib.parse import urlsplit, urlunsplit

split_url = urlsplit('http://127.0.0.1/asdf/login.php?q=abc#stackoverflow')

# You now have:
# split_url.scheme   "http"
# split_url.netloc   "127.0.0.1" 
# split_url.path     "/asdf/login.php"
# split_url.query    "q=abc"
# split_url.fragment "stackoverflow"

# Use all the path except everything after the last '/' 
clean_path = "".join(split_url.path.rpartition("/")[:-1])

# "/asdf/"

# urlunsplit joins a urlsplit tuple
clean_url = urlunsplit(split_url)

# "http://127.0.0.1/asdf/login.php?q=abc#stackoverflow"


# A more advanced example 
advanced_split_url = urlsplit('http://foo:bar@127.0.0.1:5000/asdf/login.php?q=abc#stackoverflow')

# You now have *in addition* to the above:
# advanced_split_url.username   "foo"
# advanced_split_url.password   "bar"
# advanced_split_url.hostname   "127.0.0.1"
# advanced_split_url.port       "5000"
dalanmiller
  • 3,467
  • 5
  • 31
  • 38
  • 2
    Your split and rejoin should probably use `'/'.join`, or you'll strip all the slashes. Another more clever approach might be `"".join(split_url.rpartition('/')[:-1]`, which performs only one split, and if no slashes exist, effectively becomes a noop. – ShadowRanger Feb 25 '16 at 01:36
  • Awesome tip ShadowRanger, I've always wondered if you could do that but never thought to look. Congrats on the answer ;) – dalanmiller Feb 25 '16 at 17:16
  • This should be the answer, on topic and thorough explanation! – pferrel May 17 '19 at 18:27
19

Well, for one, you could just use os.path.dirname:

>>> os.path.dirname('http://127.0.0.1/asdf/login.php')
'http://127.0.0.1/asdf'

It's not explicitly for URLs, but it happens to work on them (even on Windows), it just doesn't leave the trailing slash (you can just add it back yourself).

You may also want to look at urllib.parse.urlparse for more fine-grained parsing; if the URL has a query string or hash involved, you'd want to parse it into pieces, trim the path component returned by parsing, then recombine, so the path is trimmed without losing query and hash info.

Lastly, if you want to just split off the component after the last slash, you can do an rsplit with a maxsplit of 1, and keep the first component:

>>> 'http://127.0.0.1/asdf/login.php'.rsplit('/', 1)[0]
'http://127.0.0.1/asdf'
ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
  • 1
    Do you know why `os.path.dirname` also handle URLs so well? – dalanmiller Feb 25 '16 at 01:31
  • 3
    @dalanmiller: Because they use the same separators. Even on Windows, a forward slash is a legal path separator (it's just that Windows uses backslashes by preference), so path manipulation APIs are designed to handle forward slashes everywhere, and URLs use them in the same way. – ShadowRanger Feb 25 '16 at 01:33
  • 1
    Given that the question has to do with URLs, I'd say `urllib.parse` should be the recommended portion of stdlib to use. – dalanmiller Apr 01 '19 at 06:15
  • Indeed, there's a strong point ... URLs as well as support for them will likely change consistently with each other, whilst support for new URL peculiarities (if ever) may not be added to file handling functions like ever since potentially becoming unnecessary complexity. Though I personally still like the slightly blunt solutions and sorting out many things with a couple of approaches ;) – brezniczky Aug 10 '19 at 13:59
  • This does not work when I just tested it on Mac OSX `os.path.dirname('http://example.com')` returns `'http:'` – robmsmt Apr 15 '23 at 20:15
  • @robmsmt: It works just fine for the OP's example of stripping off the final component to keep the rest of the path. What are you even expecting that to do? It has nothing to do with what the OP requested. – ShadowRanger Apr 16 '23 at 01:02
7

There is shortest solution for Python3 with use of urllib library (don't know if fastest):

from urllib.parse import urljoin

base_url = urljoin('http://127.0.0.1/asdf/login.php', '.')
# output: http://127.0.0.1/asdf/

Keep in mind that urllib library supports uri/url compatible with HTML's keyword. It means that uri/url ending with '/' means different that without like here https://stackoverflow.com/a/1793282/7750840/:

base_url = urljoin('http://127.0.0.1/asdf/', '.')
# output: http://127.0.0.1/asdf/

base_url = urljoin('http://127.0.0.1/asdf', '.')
# output: http://127.0.0.1/

This is link to urllib for python: https://pythonprogramming.net/urllib-tutorial-python-3/

bukas
  • 161
  • 2
  • 2
7

Agree that best way to do it is with urllib.parse

Specifically, you can decompose the url with urllib.parse.urlparse and then replace every attribute other than scheme and netloc with an empty string. If you want to keep the path attribute (as in your question), you can do so with an extra string parsing step. Example function below:

import urllib.parse
def base_url(url, with_path=False):
    parsed = urllib.parse.urlparse(url)
    path   = '/'.join(parsed.path.split('/')[:-1]) if with_path else ''
    parsed = parsed._replace(path=path)
    parsed = parsed._replace(params='')
    parsed = parsed._replace(query='')
    parsed = parsed._replace(fragment='')
    return parsed.geturl()

Examples:

>>> base_url('http://127.0.0.1/asdf/login.php', with_path=True)
'http://127.0.0.1/asdf'
>>> base_url('http://127.0.0.1/asdf/login.php')
'http://127.0.0.1'
rodms
  • 371
  • 3
  • 12
1

No need to use a regex, you can just use rsplit():

>>> url = 'http://127.0.0.1/asdf/login.php'
>>> url.rsplit('/', 1)[0]
'http://127.0.0.1/asdf'
pp_
  • 3,435
  • 4
  • 19
  • 27
1

When you use urlsplit, it returns a SplitResult object:

from urllib.parse import urlsplit
split_url = urlsplit('http://127.0.0.1/asdf/login.php')
print(split_url)

>>> SplitResult(scheme='http' netloc='127.0.0.1' path='/asdf/login.php' query='' fragment='') 

You can make your own SplitResult() object and pass it through urlunsplit. This code should work for multiple url splits, regardless of their length, as long as you know what the last path element you want is.

from urllib.parse import urlsplit, urlunsplit, SplitResult

# splitting url:
split_url = urlsplit('http://127.0.0.1/asdf/login.php')

# editing the variables you want to change (in this case, path):    
last_element = 'asdf'   # this can be any element in the path.
path_array = split_url.path.split('/')

# print(path_array)
# >>> ['', 'asdf', 'login.php']

path_array.remove('') 
ind = path_array.index(last_element) 
new_path = '/' + '/'.join(path_array[:ind+1]) + '/'

# making SplitResult() object with edited data:
new_url = SplitResult(scheme=split_url.scheme, netloc=split_url.netloc, path=new_path, query='', fragment='')

# unsplitting:
base_url = urlunsplit(new_url)
Sarah Kay
  • 81
  • 5
0

Get the right-most occurence of slash; use the string slice through that position in the original string. The +1 gets you that final slash at the end.

link = "http://127.0.0.1/asdf/login.php"
link[:link.rfind('/')+1]
Prune
  • 76,765
  • 14
  • 60
  • 81
  • 3
    Probably better to use `rindex` for this; otherwise, if you have a string without slashes, you'll silently return the empty string (because `rfind` will return -1, you'll add 1, and slice from 0 to 0). At least with `rindex`, you'll get an exception rather than continuing on until having an empty string causes everything to blow up. – ShadowRanger Feb 25 '16 at 01:30
0

If you use python3, you can use urlparse and urlunparse.

In :from urllib.parse import urlparse, urlunparse

In :url = "http://127.0.0.1/asdf/login.php"

In :result = urlparse(url)

In :new = list(result)

In :new[2] = new[2].replace("login.php", "")

In :urlunparse(new)
Out:'http://127.0.0.1/asdf/'
bzd111
  • 387
  • 4
  • 6