92

I have the following URL:

url = http://photographs.500px.com/kyle/09-09-201315-47-571378756077.jpg

I would like to extract the file name in this URL: 09-09-201315-47-571378756077.jpg

Once I get this file name, I'm going to save it with this name to the Desktop.

filename = **extracted file name from the url**     
download_photo = urllib.urlretrieve(url, "/home/ubuntu/Desktop/%s.jpg" % (filename))

After this, I'm going to resize the photo, once that is done, I've going to save the resized version and append the word "_small" to the end of the filename.

downloadedphoto = Image.open("/home/ubuntu/Desktop/%s.jpg" % (filename))               
resize_downloadedphoto = downloadedphoto.resize.((300, 300), Image.ANTIALIAS)
resize_downloadedphoto.save("/home/ubuntu/Desktop/%s.jpg" % (filename + _small))

From this, what I am trying to achieve is to get two files, the original photo with the original name, then the resized photo with the modified name. Like so:

09-09-201315-47-571378756077.jpg

rename to:

09-09-201315-47-571378756077_small.jpg

How can I go about doing this?

funnydman
  • 9,083
  • 4
  • 40
  • 55
deadlock
  • 7,048
  • 14
  • 67
  • 115

12 Answers12

231

You can use urllib.parse.urlparse with os.path.basename:

import os
from urllib.parse import urlparse

url = "http://photographs.500px.com/kyle/09-09-201315-47-571378756077.jpg"
a = urlparse(url)
print(a.path)                    # Output: /kyle/09-09-201315-47-571378756077.jpg
print(os.path.basename(a.path))  # Output: 09-09-201315-47-571378756077.jpg

Your URL might contain percent-encoded characters like %20 for space or %E7%89%B9%E8%89%B2 for "特色". If that's the case, you'll need to unquote (or unquote_plus) them. You can also use pathlib.Path().name instead of os.path.basename, which could help to add a suffix in the name (like asked in the original question):

from pathlib import Path
from urllib.parse import urlparse, unquote

url = "http://photographs.500px.com/kyle/09-09-2013%20-%2015-47-571378756077.jpg"
urlparse(url).path

url_parsed = urlparse(url)
print(unquote(url_parsed.path))  # Output: /kyle/09-09-2013 - 15-47-571378756077.jpg
file_path = Path("/home/ubuntu/Desktop/") / unquote(Path(url_parsed.path).name)
print(file_path)        # Output: /home/ubuntu/Desktop/09-09-2013 - 15-47-571378756077.jpg

new_file = file_path.with_stem(file_path.stem + "_small")
print(new_file)         # Output: /home/ubuntu/Desktop/09-09-2013 - 15-47-571378756077_small.jpg

Also, an alternative is to use unquote(urlparse(url).path.split("/")[-1]).

Jean-Francois T.
  • 11,549
  • 7
  • 68
  • 107
Ofir Israel
  • 3,785
  • 2
  • 15
  • 13
  • 7
    caution: os.path in windows might expect "\" – vatsa Feb 16 '18 at 21:10
  • 13
    You don't even need urlparse. `os.path.basename(url)` works perfect. – elky Jun 03 '18 at 11:23
  • 31
    @elky One does need urlparse. Only with using urlparse an url with query string like `http://photographs.500px.com/kyle/09-09-201315-47-571378756077.jpg?size=1000px` will be extracted to a filename `09-09-201315-47-571378756077.jpg`. If you only use `os.path.basename(url)` the extracted filename will include the query-string: `09-09-201315-47-571378756077.jpg?size=1000px` . This is usually not the desired solution. – asmaier Dec 27 '18 at 11:10
  • 1
    Because the separator on Windows is different, I have confirmed that this solution works on Windows. – Luke Apr 01 '21 at 01:02
  • @Jean-Francois lets not add too much to the answer and I think you should `urlparse` the URL as it is before you `unquote`, because `unquote` doesn't expect a URL, it expects just `/the/path/part` of the url. – Boris Verkhovskiy Apr 25 '22 at 03:28
  • @BorisV Good point :) Although `unquote` before ` urlparse` does work, actually and the code looks slightly neater. – Jean-Francois T. Apr 25 '22 at 03:47
  • 2
    @Jean-FrancoisT. it doesn't work, you just didn't think of the edge cases, like when you have a percent encoded `#`. Try `Path(unquote(urlparse('http://example.com/my%20%23superawesome%20picture.jpg').path)).name` vs `Path(urlparse(unquote('http://example.com/my%20%23superawesome%20picture.jpg')).path).name`. It's just never a good idea to blindly modify something you intend to parse before parsing it. – Boris Verkhovskiy Apr 25 '22 at 03:56
  • @BorisV Good point. Corrected – Jean-Francois T. Apr 25 '22 at 05:56
35

os.path.basename(url)

Why try harder?

In [1]: os.path.basename("https://example.com/file.html")
Out[1]: 'file.html'

In [2]: os.path.basename("https://example.com/file")
Out[2]: 'file'

In [3]: os.path.basename("https://example.com/")
Out[3]: ''

In [4]: os.path.basename("https://example.com")
Out[4]: 'example.com'

Note 2020-12-20

Nobody has thus far provided a complete solution.

A URL can contain a ?[query-string] and/or a #[fragment Identifier] (but only in that order: ref)

In [1]: from os import path

In [2]: def get_filename(url):
   ...:     fragment_removed = url.split("#")[0]  # keep to left of first #
   ...:     query_string_removed = fragment_removed.split("?")[0]
   ...:     scheme_removed = query_string_removed.split("://")[-1].split(":")[-1]
   ...:     if scheme_removed.find("/") == -1:
   ...:         return ""
   ...:     return path.basename(scheme_removed)
   ...:

In [3]: get_filename("a.com/b")
Out[3]: 'b'

In [4]: get_filename("a.com/")
Out[4]: ''

In [5]: get_filename("https://a.com/")
Out[5]: ''

In [6]: get_filename("https://a.com/b")
Out[6]: 'b'

In [7]: get_filename("https://a.com/b?c=d#e")
Out[7]: 'b'
P i
  • 29,020
  • 36
  • 159
  • 267
  • 4
    @Pi "Nobody has thus far provided a complete solution" the accepted answer is a "complete solution" that throws out the `#` and `?` parts of the URL which it does using the URL parsing built into Python (which might handle an edge case you didn't think of). – Boris Verkhovskiy Jan 24 '21 at 06:18
  • I prefer this answer to the one above that uses `urllib.parse.urlparse` with `os.path.basename` by @Boris, because this answer only imports the `os` package, not urllib which is mostly duplicated by Requests and superseded by urllib2. One less dependency to become obsolete and causing future code maintenance. – Rich Lysakowski PhD Mar 25 '21 at 03:37
  • 1
    @RichLysakowskiPhD there is no such thing as `urllib2` on Python 3 and `requests` [uses `urllib.parse` under the hood](https://github.com/psf/requests/search?q=urllib). How is implementing URL parsing yourself a smaller maintenance burden than an import? – Boris Verkhovskiy Mar 25 '21 at 03:55
  • @Boris you are right. urllib2 does not exist in Python 3, so urllib built into Python or requests is the way to go. Thank you for clarifying with a source url : https://github.com/psf/requests/blob/4f6c0187150af09d085c03096504934eb91c7a9e/requests/compat.py – Rich Lysakowski PhD Mar 25 '21 at 04:43
  • I find the topmost solution more clean. I guess this is just an old post? – GuiTaek Oct 06 '21 at 12:56
  • @BorisV edge cases like: `"https://toto.com/dir/my%20file%20has%20spaces.txt"` which contain spaces... This would be handled by `unquote` in `urllib.parse`. – Jean-Francois T. Apr 25 '22 at 03:01
  • Would not that be easier with regex instead of multiple `split` / `find`? – Jean-Francois T. Apr 25 '22 at 03:11
22
filename = url[url.rfind("/")+1:]
filename_small = filename.replace(".", "_small.")

maybe use ".jpg" in the last case since a . can also be in the filename.

RickyA
  • 15,465
  • 5
  • 71
  • 95
17

You could just split the url by "/" and retrieve the last member of the list:

url = "http://photographs.500px.com/kyle/09-09-201315-47-571378756077.jpg"
filename = url.split("/")[-1] 
#09-09-201315-47-571378756077.jpg

Then use replace to change the ending:

small_jpg = filename.replace(".jpg", "_small.jpg")
#09-09-201315-47-571378756077_small.jpg
funnydman
  • 9,083
  • 4
  • 40
  • 55
Bryan
  • 1,938
  • 13
  • 12
11

With python3 (from 3.4 upwards) you can abuse the pathlib library in the following way:

from pathlib import Path

p = Path('http://example.com/somefile.html')
print(p.name)
# >>> 'somefile.html'

print(p.stem)
# >>> 'somefile'

print(p.suffix)
# >>> '.html'

print(f'{p.stem}-spamspam{p.suffix}')
# >>> 'somefile-spamspam.html'

❗️ WARNING

The pathlib module is NOT meant for parsing URLs — it is designed to work with POSIX paths only. Don't use it in production code! It's a dirty quick hack for non-critical code. The fact that pathlib also works with URLs can be considered an accident that might be fixed in future releases. The code is only provided as an example of what you can but probably should not do. If you need to parse URLs in a canonic way then prefer using urllib.parse or alternatives. Or, if you make an assumption that the portion after the domain and before the parameters+queries+hash is supposedly a POSIX path then you can extract just the path fragment using urllib.parse.urlparse and then use pathlib.Path to manipulate it.

ccpizza
  • 28,968
  • 18
  • 162
  • 169
  • 2
    This breaks with URLs with stuff after the path. `Path('http://example.com/somefile.html?some-querystring#some-id').name` will return `'somefile.html?some-querystring#some-id'` – Boris Verkhovskiy Apr 25 '22 at 03:36
9

Use urllib.parse.urlparse to get just the path part of the URL, and then use pathlib.Path on that path to get the filename:

from urllib.parse import urlparse
from pathlib import Path


url = "http://example.com/some/long/path/a_filename.jpg?some_query_params=true&some_more=true#and-an-anchor"
a = urlparse(url)
a.path             # '/some/long/path/a_filename.jpg'
Path(a.path).name  # 'a_filename.jpg'
Community
  • 1
  • 1
Boris Verkhovskiy
  • 14,854
  • 11
  • 100
  • 103
  • Seems like this might not work if you were running on Windows, right? – Stephen Aug 30 '20 at 21:04
  • @Stephen it will work because `pathlib` [uses forward slashes](https://docs.python.org/3/library/pathlib.html#pathlib.WindowsPath) when defining paths, even on Windows. However note that `pathlib` converts `"/"` to `"\"` on Windows when you convert `Path` objects to `str` or `bytes`, so if you're modifying the above code to do something different, like getting the filename *and* the part before it (as in `path/a_filename.jpg`) but you want to keep forward slashes as forward slashes, you can do `str(PurePosixPath(urlparse(url).path))` instead of `str(Path(urlparse(url).path))`. – Boris Verkhovskiy Jan 24 '21 at 06:32
2

Sometimes there is a query string:

filename = url.split("/")[-1].split("?")[0] 
new_filename = filename.replace(".jpg", "_small.jpg")
user2821
  • 1,568
  • 2
  • 12
  • 16
1

A simple version using the os package:

import os

def get_url_file_name(url):
    url = url.split("#")[0]
    url = url.split("?")[0]
    return os.path.basename(url)

Examples:

print(get_url_file_name("example.com/myfile.tar.gz"))  # 'myfile.tar.gz'
print(get_url_file_name("example.com/"))  # ''
print(get_url_file_name("https://example.com/"))  # ''
print(get_url_file_name("https://example.com/hello.zip"))  # 'hello.zip'
print(get_url_file_name("https://example.com/args.tar.gz?c=d#e"))  # 'args.tar.gz'
Jossef Harush Kadouri
  • 32,361
  • 10
  • 130
  • 129
1

Sometimes the link you have can have redirects (that was the case for me). In that case you have to solve the redirects

import requests
url = "http://photographs.500px.com/kyle/09-09-201315-47-571378756077.jpg"
response = requests.head(url)
url = response.url

then you can continue with the best answer at the moment (Ofir's)

import os
from urllib.parse import urlparse


a = urlparse(url)
print(a.path)                    # Output: /kyle/09-09-201315-47-571378756077.jpg
print(os.path.basename(a.path))  # Output: 09-09-201315-47-571378756077.jpg

it doesn't work with this page however, as the page isn't available anymore

GuiTaek
  • 398
  • 2
  • 12
0

Python split url to find image name and extension

helps you to extract the image name. to append name :

imageName =  '09-09-201315-47-571378756077'

new_name = '{0}_small.jpg'.format(imageName) 
Community
  • 1
  • 1
Moj
  • 6,137
  • 2
  • 24
  • 36
0

I see people using the Pathlib library to parse URLs. This is not a good idea! Pathlib is not designed for it, use special libraries like urllib or similar instead.

This is the most stable version I could come up with. It handles params as well as fragments:

from urllib.parse import urlparse, ParseResult

def update_filename(url):
    parsed_url = urlparse(url)
    path = parsed_url.path

    filename = path[path.rfind('/') + 1:]

    if not filename:
        return

    file, extension = filename.rsplit('.', 1)

    new_path = parsed_url.path.replace(filename, f"{file}_small.{extension}")
    parsed_url = ParseResult(**{**parsed_url._asdict(), 'path': new_path})

    return parsed_url.geturl()

Example:

assert update_filename('https://example.com/') is None
assert update_filename('https://example.com/path/to/') is None
assert update_filename('https://example.com/path/to/report.pdf') == 'https://example.com/path/to/report_small.pdf'
assert update_filename('https://example.com/path/to/filename with spaces.pdf') == 'https://example.com/path/to/filename with spaces_small.pdf'
assert update_filename('https://example.com/path/to/report_01.01.2022.pdf') == 'https://example.com/path/to/report_01.01.2022_small.pdf'
assert update_filename('https://example.com/path/to/report.pdf?param=1&param2=2') == 'https://example.com/path/to/report_small.pdf?param=1&param2=2'
assert update_filename('https://example.com/path/to/report.pdf?param=1&param2=2#test') == 'https://example.com/path/to/report_small.pdf?param=1&param2=2#test'
funnydman
  • 9,083
  • 4
  • 40
  • 55
-1

We can extract filename from a url by using ntpath module.

import ntpath
url = 'http://photographs.500px.com/kyle/09-09-201315-47-571378756077.jpg'
name, ext = ntpath.splitext(ntpath.basename(url))
# 09-09-201315-47-571378756077  .jpg


print(name + '_small' + ext)
09-09-201315-47-571378756077_small.jpg