How to safely get the file extension from a URL?

Question

Consider the following URLs

http://m3u.com/tunein.m3u
http://asxsomeurl.com/listen.asx:8024
http://www.plssomeotherurl.com/station.pls?id=111
http://22.198.133.16:8024

Whats the proper way to determine the file extensions (.m3u/.asx/.pls)? Obviously the last one doesn't have a file extension.

EDIT: I forgot to mention that m3u/asx/pls are playlists (textfiles) for audio streams and must be parsed differently. The goal determine the extension and then send the url to the proper parsing-function. E.g.


url = argv[1]
ext = GetExtension(url)
if ext == "pls":
  realurl = ParsePLS(url)
elif ext == "asx":
  realurl = ParseASX(url)
(etc.)
else:
  realurl = url
Play(realurl)

GetExtension() should return the file extension (if any), preferrably without connecting to the URL.

You may find this SO question http://stackoverflow.com/questions/2277030/ useful. — Marcus Whybrow, Jan 23 '11 at 22:20
What do you want to do with the file extension, and how will you handle the file not matching the file type you thought that extension should have? — Fred Nurk, Jan 23 '11 at 22:25

score 56 · Answer 1 · edited Sep 28 '22 at 15:05

56

Use urlparse to parse the path out of the URL, then os.path.splitext to get the extension.

import os
try:
    import urlparse
except ImportError:
    from urllib.parse import urlparse

url = 'http://www.plssomeotherurl.com/station.pls?id=111'
path = urlparse.urlparse(url).path
ext = os.path.splitext(path)[1]

Note that the extension may not be a reliable indicator of the type of the file. The HTTP Content-Type header may be better.

edited Sep 28 '22 at 15:05

kloddant

1,026
12
19

answered Jan 23 '11 at 22:22

payne

13,833
5
42
49

10

Quick note: For Python 3 you should use urllib.parse: https://docs.python.org/3/library/urllib.parse.html. Otherwise you will get a `ModuleNotFoundError: No module named 'urlparse'` exception – AnhellO Jun 21 '19 at 00:31

score 46 · Answer 2 · answered Feb 17 '14 at 18:15

46

This is easiest with requests and mimetypes:

import requests
import mimetypes

response = requests.get(url)
content_type = response.headers['content-type']
extension = mimetypes.guess_extension(content_type)

The extension includes a dot prefix. For example, extension is '.png' for content type 'image/png'.

answered Feb 17 '14 at 18:15

Seth

6,514
5
49
58

2

BTW this assumes you want to retrieve the contents of the URL. – Seth Feb 17 '14 at 18:16
2

mimetypes's `guess_extension` function does have it's quirks though. Hand `request` a url for a file with the '.jpg' extension and it identifies it as MIME type 'image/jpeg'. Hand that over to `mimetypes` and ask it for a reasonable extension and it spits out '.jpe'. Not wrong, just... not helpful. – brokkr Jun 22 '17 at 07:01
@brokkr yeah .jpe is valid, but that sounds like a bug to me, like `guess_extension` isn't pulling the most likely/popular from a list of valid extensions. – Seth Jun 22 '17 at 10:43
1

Whether bug or WAI it simple seems to pick the first in a list: https://stackoverflow.com/a/11396288/68595 – brokkr Jun 22 '17 at 11:18
2

response = response.head(url) is more efficient for this task – acarayol Sep 25 '17 at 21:16
1

@acarayol if you're not interested in the resource itself, then yes you are correct. – Seth Sep 25 '17 at 22:19

Greg Hewgill · Accepted Answer · 2019-07-22T20:38:42.583

25

The real proper way is to not use file extensions at all. Do a GET (or HEAD) request to the URL in question, and use the returned "Content-type" HTTP header to get the content type. File extensions are unreliable.

See MIME types (IANA media types) for more information and a list of useful MIME types.

edited Jul 22 '19 at 20:38

answered Jan 23 '11 at 22:21

Greg Hewgill

951,095
183
1,149
1,285

True, but what if you want a gui to pop up to save the thing? What filename do you use, and what extension do you put in your save dialog - given the URL _and_ the content-type headers? – Spacedman Jan 23 '11 at 22:23
@Spacedman: You should check if the URL path extension matches response mimetype (`mimetypes.guess_extension` might be helpful) - if not append the correct one. AFAIK that's what web browsers do. – Tomasz Elendt Jan 23 '11 at 22:32
What if "Content-type" header is missing? – Tarasovych Sep 11 '20 at 07:00
1

Note that using mime types is also unreliable. Sometimes a web server cannot determine the mime type, and returns "application/octet-stream" by default. See: bitkeys.work/btc_balance_sorted.csv. Mime-type in this case is "text/csv" but the header shows "application/octet-stream". This is why you also need to check the file extension (or, maybe better, the file header/signature). Also mimetypes.guess_extension is useless in this scenario. More info on checking file signatures: https://github.com/ahupp/python-magic and https://github.com/schlerp/pyfsig. – FifthAxiom Jun 25 '21 at 04:28

score 6 · Answer 4 · answered Jan 23 '11 at 22:22

File extensions are basically meaningless in URLs. For example, if you go to http://code.google.com/p/unladen-swallow/source/browse/branches/release-2009Q1-maint/Lib/psyco/support.py?r=292 do you want the extension to be ".py" despite the fact that the page is HTML, not Python?

Use the Content-Type header to determine the "type" of a URL.

score 4 · Answer 5 · answered Jan 23 '11 at 22:35

$ python3
Python 3.1.2 (release31-maint, Sep 17 2010, 20:27:33) 
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from os.path import splitext
>>> from urllib.parse import urlparse 
>>> 
>>> urls = [
...     'http://m3u.com/tunein.m3u',
...     'http://asxsomeurl.com/listen.asx:8024',
...     'http://www.plssomeotherurl.com/station.pls?id=111',
...     'http://22.198.133.16:8024',
... ]
>>> 
>>> for url in urls:
...     path = urlparse(url).path
...     ext = splitext(path)[1]
...     print(ext)
... 
.m3u
.asx:8024
.pls

>>>

score 2 · Answer 6 · answered Mar 16 '11 at 09:31

To get the content-type you can write a function one like I have written using urllib2. If you need to utilize page content anyway it is likely that you will use urllib2 so no need to import os.

import urllib2

def getContentType(pageUrl):
    page = urllib2.urlopen(pageUrl)
    pageHeaders = page.headers
    contentType = pageHeaders.getheader('content-type')
    return contentType

score 1 · Answer 7 · answered Jan 23 '11 at 22:21

Use urlparse, that'll get most of the above sorted:

http://docs.python.org/library/urlparse.html

then split the "path" up. You might be able to split the path up using os.path.split, but your example 2 with the :8024 on the end needs manual handling. Are your file extensions always three letters? Or always letters and numbers? Use a regular expression.

Supergnaw · Answer 8 · 2018-06-26T09:52:51.103

A different approach that takes nothing else into account except for the actual file extension from a url:

def fileExt( url ):
    # compile regular expressions
    reQuery = re.compile( r'\?.*$', re.IGNORECASE )
    rePort = re.compile( r':[0-9]+', re.IGNORECASE )
    reExt = re.compile( r'(\.[A-Za-z0-9]+$)', re.IGNORECASE )

    # remove query string
    url = reQuery.sub( "", url )

    # remove port
    url = rePort.sub( "", url )

    # extract extension
    matches = reExt.search( url )
    if None != matches:
        return matches.group( 1 )
    return None

edit: added handling of explicit ports from :1234

score 0 · Answer 9 · answered May 11 '18 at 05:49

0

you can try the rfc6266 module like：

import requests
import rfc6266

req = requests.head(downloadLink)
headersContent = req.headers['Content-Disposition']
rfcFilename = rfc6266.parse_headers(headersContent, relaxed=True).filename_unsafe
filename = requests.utils.unquote(rfcFilename)

answered May 11 '18 at 05:49

tom mike

1
1

For safe names, see https://stackoverflow.com/questions/1976007/what-characters-are-forbidden-in-windows-and-linux-directory-names/68807910#68807910 – Cees Timmerman Sep 19 '22 at 14:15

score 0 · Answer 10 · answered May 26 '21 at 09:17

0

This is quite an old topic, but this oneliner is what did:

file_ext = "."+ url.split("/")[-1:][0].split(".")[-1:][0]

Assumption is that there is a file extension.

answered May 26 '21 at 09:17

Jani

1
1

4

It would work to just take this last split: `file_ext = "." + url.split(".")[-1]` – Joseph Apr 08 '22 at 23:54

How to safely get the file extension from a URL?

10 Answers10

Linked