Identify the file extension of a URL

Question

I am looking to extract the file extension if it exists for web addresses (trying to identify which links are to a list of extensions which I do not want e.g. .jpg, .exe etc).

So, I would want to extract from the following URL www.example.com/image.jpg the extension jpg, and also handle cases when there is no extension such as www.example.com/file (i.e. return nothing).

I can't think how to implement it, but one way I thought of was to get everything after the last dot, which if there was an extension would allow me to look that extension up, and if there wasn't, for the example www.example.com/file it would return com/file (which given is not in my list of excluded file-extensions, is fine).

There may be an alternative superior way using a package I am not aware of, which could identify what is/isn't an actual extension. (i.e. cope with cases when the URL does not actually have an extension).

score 8 · Accepted Answer · answered Feb 03 '15 at 22:31

8

The urlparse module (urllib.parse in Python 3) provides tools for working with URLs. Although it doesn't provide a way to extract the file extension from a URL, it's possible to do so by combining it with os.path.splitext:

from urlparse import urlparse
from os.path import splitext

def get_ext(url):
    """Return the filename extension from url, or ''."""
    parsed = urlparse(url)
    root, ext = splitext(parsed.path)
    return ext  # or ext[1:] if you don't want the leading '.'

Example usage:

>>> get_ext("www.example.com/image.jpg")
'.jpg'
>>> get_ext("https://www.example.com/page.html?foo=1&bar=2#fragment")
'.html'
>>> get_ext("https://www.example.com/resource")
''

answered Feb 03 '15 at 22:31

Zero Piraeus

56,143
27
150
160

So, and what will you get if URL is something like "https://www.example.com/LWUERKLFsdLKFJGJNasgdfSDsdfaL"? – Egor Richman Apr 07 '23 at 05:38
@EgorZamotaev that case is covered in the examples given. – Zero Piraeus Apr 07 '23 at 10:31
No, it isn't so. The simple way to get a file extension with requests such this: https://stackoverflow.com/a/70532887/12236467. – Egor Richman Apr 08 '23 at 03:05

score 0 · Answer 2 · answered Apr 08 '23 at 03:13

0

If you have no extension in URL, you can use response 'Content-Type' headers to get an extension, like so:

from urllib.request import urlopen

get_ext(url):
    resp = urlopen(url)
    ext = resp.info()['Content-Type'].split("/")[-1]
    return ext

answered Apr 08 '23 at 03:13

Egor Richman

559
3
13

Identify the file extension of a URL

2 Answers2

Linked