Open URL encoded filenames in Unix

Question

I'm a python n00b. I have downloaded URL encoded file and I want to work with it on my unix system(Ubuntu 14).

When I try and run some operations on my file, the system says that the file doesn't exist. How do I change my filename to a unix recognizable format?

Some of the files I have download have spaces in them so they would have to be presented with a backslash and then a space. Below is a snippet of my code

link = "http://www.stephaniequinn.com/Music/Scheherezade%20Theme.mp3"

output = open(link.split('/')[-1],'wb')
output.write(site.read())
output.close()

shutil.copy(link.split('/')[-1], tmp_dir)

Dr. Jan-Philip Gehrcke · Answer 1 · 2015-02-07T19:07:50.640

The "link" you have actually is a URL. URLs are special and are not allowed to contain certain characters, such as spaces. These special characters can still be represented, but in an encoded form. The translation from special characters to this encoded form happens via a certain rule set, often known as "URL encoding". If interested, have a read over here: http://en.wikipedia.org/wiki/Percent-encoding

The encoding operation can be inverted, which is called decoding. The tool set with which you downloaded the files you mentioned most probably did the decoding already, for you. In your link example, there is only one special character in the URL, "%20", and this encodes a space. Your download tool set probably decoded this, and saved the file to your file system with the actual space character in the file name. That is, most likely you have a file in the file system with the following basename:

Scheherezade Theme.mp3

So, when you want to open that file from within Python, and all you have is the link, you first need to get the decoded variant of it. Python can decode URL-encoded strings with built-in tools. This is what you need:

>>> import urllib.parse
>>> url = "http://www.stephaniequinn.com/Music/Scheherezade%20Theme.mp3"
>>> urllib.parse.unquote(url)
'http://www.stephaniequinn.com/Music/Scheherezade Theme.mp3'
>>>

This assumes that you are using Python 3, and that your link object is a unicode object (type str in Python 3).

Starting off with the decoded URL, you can derive the filename. Your link.split('/')[-1] method might work in many cases, but J.F. Sebastian's answer provides a more reliable method.

score 1 · Answer 2 · edited May 23 '17 at 11:57

To extract a filename from an url:

#!/usr/bin/env python2
import os
import posixpath
import urllib
import urlparse

def url2filename(url):
    """Return basename corresponding to url.

    >>> url2filename('http://example.com/path/to/file?opt=1')
    'file'
    """
    urlpath = urlparse.urlsplit(url).path  # pylint: disable=E1103
    basename = posixpath.basename(urllib.unquote(urlpath))
    if os.path.basename(basename) != basename:
        raise ValueError  # refuse 'dir%5Cbasename.ext' on Windows
    return basename

Example:

>>> url2filename("http://www.stephaniequinn.com/Music/Scheherezade%20Theme.mp3")
'Scheherezade Theme.mp3'

You do not need to escape the space in the filename if you use it inside a Python script.

See complete code example on how to download a file using Python (with a progress report).

Open URL encoded filenames in Unix

2 Answers2