2

Is there a way to turn arbitrary user input names into safe filenames with an encoding that is reversible?

I have some data files that belong to entities that users named. Of course, they can do silly things like put invalid filesystem characters in their names.

The two suggestions I see frequently for this are:

A) Base64 encode them

B) Strip illegal characters

Base64 is reversible, but for debugging/introspection, it's really nice when the file names look as much like the names as possible. Just keeps everything more debuggable. Approach B isn't reversible, so the "actual" name has to be stored redundantly anyway, so there's no real value in not just using a uuid or something.

This if specifically for Linux. While this isn't python specific, that's what I'm implementing it in.

Travis Griggs
  • 21,522
  • 19
  • 91
  • 167
  • Safe for what? Database insertion, network transmission, safe from prying eyes? – Martijn Pieters Feb 13 '14 at 18:06
  • Safe to use as a legal filename. For example you can't have a filename that includes a '/' character. – Travis Griggs Feb 13 '14 at 18:07
  • Python has a `ascii()` representation of strings, or you can encode explicitly with `unicode_escape` to turn any non-ASCII codepoint into an escape code instead. Very readable still. – Martijn Pieters Feb 13 '14 at 18:07
  • So you want to accept arbitrary filenames that may include a path separator or other illegal character, and make it useable as a filename anyway? – Martijn Pieters Feb 13 '14 at 18:09

2 Answers2

4

You could use URL encoding:

from urllib.parse import quote

safefilename = quote(filename, safe='')

This is fully round-trippable, and keeps ASCII characters readable:

>>> from urllib.parse import quote, unquote
>>> quote('foo/../bar', safe='')
'foo%2F..%2Fbar'
>>> unquote(quote('foo/../bar', safe=''))
'foo/../bar'

Do set safe to the empty string; the default is '/' so slashes are not normally escaped.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • A similar but customized scheme may be more useful (e.g. keep non-ASCII characters legibile). Maybe percent-encode the few illegal characters and paste the rest as-is (as UTF-8). This would require more effort, but it may be worthwhile. –  Feb 13 '14 at 18:14
  • @delnan: the problem is that it depends on the filesystem what is and isn't legal. HFS+ for example, translates `:` to `/`, so you don't want `:`s in your filename normally. Other filesystems have different restrictions. – Martijn Pieters Feb 13 '14 at 18:21
  • @delnan: so, without a list of what is safe, I'd go for a small subset instead. URL-quoting is the easy way out, though. – Martijn Pieters Feb 13 '14 at 18:22
  • Good point. On the other hand, since the meta characters of practically every file system are in the ASCII range, and multi-byte codepoints would give particularly ugly escape sequences, I'd still try smuggling code points > 127 through unaltered. –  Feb 13 '14 at 18:26
  • However, doesn't Linux have a filename limit of 255 bytes. What do you do for long URLs where such an encoding scheme might still result in an invalid filename that is too long? Is there any way to get a _valid_ filename given a URL that is reversible? – krypto07 May 28 '18 at 07:17
  • @krypto07 if you are going to hit the 255 byte limit, then you'll have to store the full name elsewhere with a unique identifier and use the identifier as the filename. A UUID would make a good unique identifier. You can then reverse this by lookiing up the full name again. – Martijn Pieters May 28 '18 at 10:38
  • I was hoping that some kind of compressed encoding scheme would mitigate the issue. I guess I will have to go with a separate URL store. Thanks for the prompt reply @MartijnPieters – krypto07 May 28 '18 at 15:39
  • @krypto07: compression can only go so far, especially since filenames have additional restrictions in what characters are legal. – Martijn Pieters May 28 '18 at 16:54
1

You could URL-encode the string provided by the user.

According the Wikipedia article on Percent Encoding (which itself quotes RFC 3986), the only URL-safe characters are A-Z, a-z, 0-9, dash, underscore, dot, and tilde (~). Tilde has a unique interpretation in the shell, but it's not illegal for Linux filenames.

It looks like URL-encoding is pretty easy in Python with urllib(2), but I'm not a Python programmer.

See: URL encoding/decoding with Python

Community
  • 1
  • 1
kdt
  • 148
  • 1
  • 9