How to extract a filename from a URL and append a word to it?

Question

I have the following URL:

url = http://photographs.500px.com/kyle/09-09-201315-47-571378756077.jpg

I would like to extract the file name in this URL: 09-09-201315-47-571378756077.jpg

Once I get this file name, I'm going to save it with this name to the Desktop.

filename = **extracted file name from the url**     
download_photo = urllib.urlretrieve(url, "/home/ubuntu/Desktop/%s.jpg" % (filename))

After this, I'm going to resize the photo, once that is done, I've going to save the resized version and append the word "_small" to the end of the filename.

downloadedphoto = Image.open("/home/ubuntu/Desktop/%s.jpg" % (filename))               
resize_downloadedphoto = downloadedphoto.resize.((300, 300), Image.ANTIALIAS)
resize_downloadedphoto.save("/home/ubuntu/Desktop/%s.jpg" % (filename + _small))

From this, what I am trying to achieve is to get two files, the original photo with the original name, then the resized photo with the modified name. Like so:

09-09-201315-47-571378756077.jpg

rename to:

09-09-201315-47-571378756077_small.jpg

How can I go about doing this?

score 231 · Accepted Answer · edited Apr 25 '22 at 06:26

231

You can use urllib.parse.urlparse with os.path.basename:

import os
from urllib.parse import urlparse

url = "http://photographs.500px.com/kyle/09-09-201315-47-571378756077.jpg"
a = urlparse(url)
print(a.path)                    # Output: /kyle/09-09-201315-47-571378756077.jpg
print(os.path.basename(a.path))  # Output: 09-09-201315-47-571378756077.jpg

Your URL might contain percent-encoded characters like %20 for space or %E7%89%B9%E8%89%B2 for "特色". If that's the case, you'll need to unquote (or unquote_plus) them. You can also use pathlib.Path().name instead of os.path.basename, which could help to add a suffix in the name (like asked in the original question):

from pathlib import Path
from urllib.parse import urlparse, unquote

url = "http://photographs.500px.com/kyle/09-09-2013%20-%2015-47-571378756077.jpg"
urlparse(url).path

url_parsed = urlparse(url)
print(unquote(url_parsed.path))  # Output: /kyle/09-09-2013 - 15-47-571378756077.jpg
file_path = Path("/home/ubuntu/Desktop/") / unquote(Path(url_parsed.path).name)
print(file_path)        # Output: /home/ubuntu/Desktop/09-09-2013 - 15-47-571378756077.jpg

new_file = file_path.with_stem(file_path.stem + "_small")
print(new_file)         # Output: /home/ubuntu/Desktop/09-09-2013 - 15-47-571378756077_small.jpg

Also, an alternative is to use unquote(urlparse(url).path.split("/")[-1]).

edited Apr 25 '22 at 06:26

Jean-Francois T.

11,549
7
68
107

answered Sep 10 '13 at 19:41

Ofir Israel

3,785
2
15
13

7

caution: os.path in windows might expect "\" – vatsa Feb 16 '18 at 21:10
13

You don't even need urlparse. `os.path.basename(url)` works perfect. – elky Jun 03 '18 at 11:23
31

@elky One does need urlparse. Only with using urlparse an url with query string like `http://photographs.500px.com/kyle/09-09-201315-47-571378756077.jpg?size=1000px` will be extracted to a filename `09-09-201315-47-571378756077.jpg`. If you only use `os.path.basename(url)` the extracted filename will include the query-string: `09-09-201315-47-571378756077.jpg?size=1000px` . This is usually not the desired solution. – asmaier Dec 27 '18 at 11:10
1

Because the separator on Windows is different, I have confirmed that this solution works on Windows. – Luke Apr 01 '21 at 01:02
@Jean-Francois lets not add too much to the answer and I think you should `urlparse` the URL as it is before you `unquote`, because `unquote` doesn't expect a URL, it expects just `/the/path/part` of the url. – Boris Verkhovskiy Apr 25 '22 at 03:28
@BorisV Good point :) Although `unquote` before ` urlparse` does work, actually and the code looks slightly neater. – Jean-Francois T. Apr 25 '22 at 03:47
2

@Jean-FrancoisT. it doesn't work, you just didn't think of the edge cases, like when you have a percent encoded `#`. Try `Path(unquote(urlparse('http://example.com/my%20%23superawesome%20picture.jpg').path)).name` vs `Path(urlparse(unquote('http://example.com/my%20%23superawesome%20picture.jpg')).path).name`. It's just never a good idea to blindly modify something you intend to parse before parsing it. – Boris Verkhovskiy Apr 25 '22 at 03:56
@BorisV Good point. Corrected – Jean-Francois T. Apr 25 '22 at 05:56

P i · Answer 2 · 2020-12-20T03:43:57.617

35

os.path.basename(url)

Why try harder?

In [1]: os.path.basename("https://example.com/file.html")
Out[1]: 'file.html'

In [2]: os.path.basename("https://example.com/file")
Out[2]: 'file'

In [3]: os.path.basename("https://example.com/")
Out[3]: ''

In [4]: os.path.basename("https://example.com")
Out[4]: 'example.com'

Note 2020-12-20

Nobody has thus far provided a complete solution.

A URL can contain a ?[query-string] and/or a #[fragment Identifier] (but only in that order: ref)

In [1]: from os import path

In [2]: def get_filename(url):
   ...:     fragment_removed = url.split("#")[0]  # keep to left of first #
   ...:     query_string_removed = fragment_removed.split("?")[0]
   ...:     scheme_removed = query_string_removed.split("://")[-1].split(":")[-1]
   ...:     if scheme_removed.find("/") == -1:
   ...:         return ""
   ...:     return path.basename(scheme_removed)
   ...:

In [3]: get_filename("a.com/b")
Out[3]: 'b'

In [4]: get_filename("a.com/")
Out[4]: ''

In [5]: get_filename("https://a.com/")
Out[5]: ''

In [6]: get_filename("https://a.com/b")
Out[6]: 'b'

In [7]: get_filename("https://a.com/b?c=d#e")
Out[7]: 'b'

edited Dec 20 '20 at 03:43

answered Aug 07 '18 at 11:49

P i

29,020
36
159
267

4

@Pi "Nobody has thus far provided a complete solution" the accepted answer is a "complete solution" that throws out the `#` and `?` parts of the URL which it does using the URL parsing built into Python (which might handle an edge case you didn't think of). – Boris Verkhovskiy Jan 24 '21 at 06:18
I prefer this answer to the one above that uses `urllib.parse.urlparse` with `os.path.basename` by @Boris, because this answer only imports the `os` package, not urllib which is mostly duplicated by Requests and superseded by urllib2. One less dependency to become obsolete and causing future code maintenance. – Rich Lysakowski PhD Mar 25 '21 at 03:37
1

@RichLysakowskiPhD there is no such thing as `urllib2` on Python 3 and `requests` [uses `urllib.parse` under the hood](https://github.com/psf/requests/search?q=urllib). How is implementing URL parsing yourself a smaller maintenance burden than an import? – Boris Verkhovskiy Mar 25 '21 at 03:55
@Boris you are right. urllib2 does not exist in Python 3, so urllib built into Python or requests is the way to go. Thank you for clarifying with a source url : https://github.com/psf/requests/blob/4f6c0187150af09d085c03096504934eb91c7a9e/requests/compat.py – Rich Lysakowski PhD Mar 25 '21 at 04:43
I find the topmost solution more clean. I guess this is just an old post? – GuiTaek Oct 06 '21 at 12:56
@BorisV edge cases like: `"https://toto.com/dir/my%20file%20has%20spaces.txt"` which contain spaces... This would be handled by `unquote` in `urllib.parse`. – Jean-Francois T. Apr 25 '22 at 03:01
Would not that be easier with regex instead of multiple `split` / `find`? – Jean-Francois T. Apr 25 '22 at 03:11

score 22 · Answer 3 · answered Sep 10 '13 at 19:39

22

filename = url[url.rfind("/")+1:]
filename_small = filename.replace(".", "_small.")

maybe use ".jpg" in the last case since a . can also be in the filename.

answered Sep 10 '13 at 19:39

RickyA

15,465
5
71
95

6

Just as a note, `/path/to/image27.08.2016.jpg` would become `image27_small.08_small.2016_small.jpg` – luckydonald Aug 27 '16 at 14:08
yeah its not working for all, so it should't be considered as the correct answer – Shadab K Apr 17 '20 at 17:57

score 17 · Answer 4 · edited Aug 21 '22 at 13:17

17

You could just split the url by "/" and retrieve the last member of the list:

url = "http://photographs.500px.com/kyle/09-09-201315-47-571378756077.jpg"
filename = url.split("/")[-1] 
#09-09-201315-47-571378756077.jpg

Then use replace to change the ending:

small_jpg = filename.replace(".jpg", "_small.jpg")
#09-09-201315-47-571378756077_small.jpg

edited Aug 21 '22 at 13:17

funnydman

9,083
4
40
55

answered Sep 10 '13 at 19:52

Bryan

1,938
13
12

3

Easy to read and does not use any external package, best answer. – Horai Nuri Aug 16 '18 at 08:11
8

For websites like github that add args to the url like '?raw=true', this will not work. – ahmedhosny Oct 31 '18 at 16:08

ccpizza · Answer 5 · 2023-05-08T14:03:48.017

With python3 (from 3.4 upwards) you can abuse the pathlib library in the following way:

from pathlib import Path

p = Path('http://example.com/somefile.html')
print(p.name)
# >>> 'somefile.html'

print(p.stem)
# >>> 'somefile'

print(p.suffix)
# >>> '.html'

print(f'{p.stem}-spamspam{p.suffix}')
# >>> 'somefile-spamspam.html'

❗️ WARNING

^{The pathlib module is NOT meant for parsing URLs — it is designed to work with POSIX paths only. Don't use it in production code! It's a dirty quick hack for non-critical code. The fact that pathlib also works with URLs can be considered an accident that might be fixed in future releases. The code is only provided as an example of what you can but probably should not do. If you need to parse URLs in a canonic way then prefer using urllib.parse or alternatives. Or, if you make an assumption that the portion after the domain and before the parameters+queries+hash is supposedly a POSIX path then you can extract just the path fragment using urllib.parse.urlparse and then use pathlib.Path to manipulate it.}

This breaks with URLs with stuff after the path. `Path('http://example.com/somefile.html?some-querystring#some-id').name` will return `'somefile.html?some-querystring#some-id'` — Boris Verkhovskiy, Apr 25 '22 at 03:36

score 9 · Answer 6 · edited Oct 07 '21 at 08:14

9

Use urllib.parse.urlparse to get just the path part of the URL, and then use pathlib.Path on that path to get the filename:

from urllib.parse import urlparse
from pathlib import Path


url = "http://example.com/some/long/path/a_filename.jpg?some_query_params=true&some_more=true#and-an-anchor"
a = urlparse(url)
a.path             # '/some/long/path/a_filename.jpg'
Path(a.path).name  # 'a_filename.jpg'

edited Oct 07 '21 at 08:14

Community

1
1

answered Mar 10 '20 at 19:44

Boris Verkhovskiy

14,854
11
100
103

Seems like this might not work if you were running on Windows, right? – Stephen Aug 30 '20 at 21:04
@Stephen it will work because `pathlib` [uses forward slashes](https://docs.python.org/3/library/pathlib.html#pathlib.WindowsPath) when defining paths, even on Windows. However note that `pathlib` converts `"/"` to `"\"` on Windows when you convert `Path` objects to `str` or `bytes`, so if you're modifying the above code to do something different, like getting the filename *and* the part before it (as in `path/a_filename.jpg`) but you want to keep forward slashes as forward slashes, you can do `str(PurePosixPath(urlparse(url).path))` instead of `str(Path(urlparse(url).path))`. – Boris Verkhovskiy Jan 24 '21 at 06:32

score 2 · Answer 7 · answered Jun 10 '19 at 03:38

2

Sometimes there is a query string:

filename = url.split("/")[-1].split("?")[0] 
new_filename = filename.replace(".jpg", "_small.jpg")

answered Jun 10 '19 at 03:38

user2821

1,568
2
12
16

1

sometimes there's a `#fragment` like this: https://tools.ietf.org/html/rfc3986#section-3.5 – Boris Verkhovskiy Dec 05 '20 at 17:36

score 1 · Answer 8 · answered Feb 17 '21 at 18:37

A simple version using the os package:

import os

def get_url_file_name(url):
    url = url.split("#")[0]
    url = url.split("?")[0]
    return os.path.basename(url)

Examples:

print(get_url_file_name("example.com/myfile.tar.gz"))  # 'myfile.tar.gz'
print(get_url_file_name("example.com/"))  # ''
print(get_url_file_name("https://example.com/"))  # ''
print(get_url_file_name("https://example.com/hello.zip"))  # 'hello.zip'
print(get_url_file_name("https://example.com/args.tar.gz?c=d#e"))  # 'args.tar.gz'

score 1 · Answer 9 · answered Oct 06 '21 at 13:08

Sometimes the link you have can have redirects (that was the case for me). In that case you have to solve the redirects

import requests
url = "http://photographs.500px.com/kyle/09-09-201315-47-571378756077.jpg"
response = requests.head(url)
url = response.url

then you can continue with the best answer at the moment (Ofir's)

import os
from urllib.parse import urlparse


a = urlparse(url)
print(a.path)                    # Output: /kyle/09-09-201315-47-571378756077.jpg
print(os.path.basename(a.path))  # Output: 09-09-201315-47-571378756077.jpg

it doesn't work with this page however, as the page isn't available anymore

score 0 · Answer 10 · edited May 23 '17 at 12:08

0

Python split url to find image name and extension

helps you to extract the image name. to append name :

imageName =  '09-09-201315-47-571378756077'

new_name = '{0}_small.jpg'.format(imageName)

edited May 23 '17 at 12:08

Community

1
1

answered Sep 10 '13 at 19:41

Moj

6,137
2
24
36

score 0 · Answer 11 · answered Aug 21 '22 at 13:10

I see people using the Pathlib library to parse URLs. This is not a good idea! Pathlib is not designed for it, use special libraries like urllib or similar instead.

This is the most stable version I could come up with. It handles params as well as fragments:

from urllib.parse import urlparse, ParseResult

def update_filename(url):
    parsed_url = urlparse(url)
    path = parsed_url.path

    filename = path[path.rfind('/') + 1:]

    if not filename:
        return

    file, extension = filename.rsplit('.', 1)

    new_path = parsed_url.path.replace(filename, f"{file}_small.{extension}")
    parsed_url = ParseResult(**{**parsed_url._asdict(), 'path': new_path})

    return parsed_url.geturl()

Example:

assert update_filename('https://example.com/') is None
assert update_filename('https://example.com/path/to/') is None
assert update_filename('https://example.com/path/to/report.pdf') == 'https://example.com/path/to/report_small.pdf'
assert update_filename('https://example.com/path/to/filename with spaces.pdf') == 'https://example.com/path/to/filename with spaces_small.pdf'
assert update_filename('https://example.com/path/to/report_01.01.2022.pdf') == 'https://example.com/path/to/report_01.01.2022_small.pdf'
assert update_filename('https://example.com/path/to/report.pdf?param=1&param2=2') == 'https://example.com/path/to/report_small.pdf?param=1&param2=2'
assert update_filename('https://example.com/path/to/report.pdf?param=1&param2=2#test') == 'https://example.com/path/to/report_small.pdf?param=1&param2=2#test'

score -1 · Answer 12 · answered Jul 11 '20 at 04:43

We can extract filename from a url by using ntpath module.

import ntpath
url = 'http://photographs.500px.com/kyle/09-09-201315-47-571378756077.jpg'
name, ext = ntpath.splitext(ntpath.basename(url))
# 09-09-201315-47-571378756077  .jpg


print(name + '_small' + ext)
09-09-201315-47-571378756077_small.jpg

How to extract a filename from a URL and append a word to it?

12 Answers12

❗️ WARNING

Linked