51

In WeasyPrint’s public API I accept filenames (among other types) for the HTML inputs. Any filename that works with the built-in open() should work, but I need to convert it to an URL in the file:// scheme that will later be passed to urllib.urlopen().

(Everything is in URL form internally. I need to have a "base URL" for documents in order to resolve relative URL references with urlparse.urljoin().)

urllib.pathname2url is a start:

Convert the pathname path from the local syntax for a path to the form used in the path component of a URL. This does not produce a complete URL. The return value will already be quoted using the quote() function.

The emphasis is mine, but I do need a complete URL. So far this seems to work:

def path2url(path):
    """Return file:// URL from a filename."""
    path = os.path.abspath(path)
    if isinstance(path, unicode):
        path = path.encode('utf8')
    return 'file:' + urlparse.pathname2url(path)

UTF-8 seems to be recommended by RFC 3987 (IRI). But in this case (the URL is meant for urllib, eventually) maybe I should use sys.getfilesystemencoding()?

However, based on the literature I should prepend not just file: but file:// ... except when I should not: On Windows the results from nturl2path.pathname2url() already start with three slashes.

So the question is: is there a better way to do this and make it cross-platform?

Community
  • 1
  • 1
Simon Sapin
  • 9,790
  • 3
  • 35
  • 44

4 Answers4

93

For completeness, in Python 3.4+, you should do:

import pathlib

pathlib.Path(absolute_path_string).as_uri()
ToBeReplaced
  • 3,334
  • 2
  • 26
  • 42
33

I'm not sure the docs are rigorous enough to guarantee it, but I think this works in practice:

import urlparse, urllib

def path2url(path):
    return urlparse.urljoin(
      'file:', urllib.pathname2url(path))
Dave Abrahams
  • 7,416
  • 5
  • 31
  • 19
  • 3
    Tested on Linux, Windows, and OS X and it works fine on all three. – javawizard Jun 11 '13 at 19:06
  • 6
    And in py3k this becomes `import urlib.parse as urlparse` and `import urlib.request as urllib` – danodonovan Feb 07 '14 at 11:28
  • 1
    you should probably call `os.path.abspath(path)` here. – jfs Dec 19 '14 at 22:14
  • 2
    If you use the [six](https://pythonhosted.org/six/) library to assure Python 2 and 3 portability: `return six.moves.urllib_parse.urljoin( "file://", six.moves.urllib.request.pathname2url(path))` – George V. Reilly Jan 04 '15 at 23:30
  • 1
    This produces urls that look like `file:///C:/foo%20bar/spam/eggs"` Shouldn't it be `file:///C%3A/foo%20bar/spam/eggs"` with the colon turned into `%3A`? – stib Mar 20 '15 at 14:40
  • 1
    @stib: aparently not, not on Windows, where the drive letter colon is to be kept intact. See https://blogs.msdn.microsoft.com/ie/2006/12/06/file-uris-in-windows/ and https://en.wikipedia.org/wiki/File_URI_scheme#Windows_2 – Martijn Pieters Nov 17 '17 at 18:13
5

Credit to comment from @danodonovan above.

For Python3, the following code will work:

from urllib.parse import urljoin
from urllib.request import pathname2url

def path2url(path):
    return urljoin('file:', pathname2url(path))
Antoine
  • 3,880
  • 2
  • 26
  • 44
kevinarpe
  • 20,319
  • 26
  • 127
  • 154
1

Does the following work for you?

from urlparse import urlparse, urlunparse

urlunparse(urlparse('yourURL')._replace(scheme='file'))
Simon Sapin
  • 9,790
  • 3
  • 35
  • 44
Jon Clements
  • 138,671
  • 33
  • 247
  • 280
  • The idea is interesting, but I don’t know if this is enough. In particular, `\` in Windows filenames is supposed to become `/`. Still on Windows, The C in `C:\foo\bar.html` is parsed as a scheme and then replaced. The expected output would be `file:///C:/foo/bar.html`. – Simon Sapin Jul 27 '12 at 12:56