70

I want to create a sane/safe filename (i.e. somewhat readable, no "strange" characters, etc.) from some random Unicode string (which might contain just anything).

(It doesn't matter for me whether the function is Cocoa, ObjC, Python, etc.)


Of course, there might be infinite many characters which might be strange. Thus, it is not really a solution to have a blacklist and to add more and more to that list over the time.

I could have a whitelist. However, I don't really know how to define it. [a-zA-Z0-9 .] is a start but I also want to accept unicode chars which can be displayed in a normal way.

Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153
Albert
  • 65,406
  • 61
  • 242
  • 386
  • Am I correct in understanding that you want this to be internationalizable? – N_A Sep 13 '11 at 18:10
  • @mydogisbox: No, just a single (unicode) filename from the input. – Albert Sep 13 '11 at 18:24
  • 5
    “no "strange" characters… but I also want to accept unicode chars which can be displayed in a normal way.” The problem that there's an intersection between those sets. For example, if a user writes an article about [Феликс Дзержинский](http://en.wikipedia.org/wiki/Feliks_Dzerzhinsky), is that ‘р’ a Latin ‘p’ or a Cyrillic ‘p’? (Yes, they really are two different characters. Paste into UnicodeChecker to see.) – Peter Hosey Sep 13 '11 at 18:44
  • 2
    … As for why that's a “strange” character, a few years ago, there was a flurry of news and analysis reports about how phishing scammers had started using characters like that to make fake but real-looking domain names (“paypal.com”, for a made-up-just-now example). Browsers such as Safari now render such domains as “Punycode” (bit like half-base64 half-ASCII) for that reason. So, that character and the many others like it can be used for good **or** evil—and that's the problem. – Peter Hosey Sep 13 '11 at 18:51
  • Since this isn't a one-to-one character mapping, it sounds like you'll also need to check for duplicate filenames. – octern May 21 '12 at 00:27
  • 1
    -1. I don't think this question is well defined at all. "Sane" and "strange" mean nothing. Either accept anything that the filesystem actually accepts (in which case this question is a duplicate), or accept a clearly defined subset of ascii (in which case this question is trivial). – Clément Dec 12 '15 at 01:29
  • @Clément: Ofc it's not well defined. The question was also in the sense if there maybe is some straight-forward answer, so your comment is kind of the answer "no, there is not" - but I don't know that. Maybe Unicode defines something like invisible (strange) chars, or canonical chars or so. I don't know. Anyway, the accepted answer is kind of straight-forward and I'm happy with it now. And it's neither the two cases you describe, it's much better. – Albert Dec 12 '15 at 13:31
  • Questions should ask for code in one specific language, unless they are language-agnostic questions about algorithms, or unless they are specifically about interoperating between code written in two different languages (e.g. creating a Python extension in C or embedding a Python interpreter within a C program). Since everyone gave Python answers, I removed the other tags. I also agree that this is a duplicate, and marked it as such. – Karl Knechtel Aug 01 '22 at 20:08
  • @KarlKnechtel I specifically asked a language-agnostic question about an algorithm here. Python is just a good language to specify a generic algorithm but in my question I very specifically did not want it to be Python specific. – Albert Aug 08 '22 at 22:24
  • https://pypi.org/project/Unidecode/ – Andrew Nov 30 '22 at 16:50

13 Answers13

94

Python:

"".join([c for c in filename if c.isalpha() or c.isdigit() or c==' ']).rstrip()

this accepts Unicode characters but removes line breaks, etc.

example:

filename = u"ad\nbla'{-+\)(ç?"

gives: adblaç

edit str.isalnum() does alphanumeric on one step. – comment from queueoverflow below. danodonovan hinted on keeping a dot included.

    keepcharacters = (' ','.','_')
    "".join(c for c in filename if c.isalnum() or c in keepcharacters).rstrip()
wallyk
  • 56,922
  • 16
  • 83
  • 148
Remi
  • 20,619
  • 8
  • 57
  • 41
  • 1
    Oh cool, yea, I didn't knew `str.isalpha()` also works for such unicode chars. – Albert Sep 13 '11 at 18:25
  • Doesn't this also omit spaces? – Peter Hosey Sep 13 '11 at 18:37
  • It does actually... Is that a problem here for @Albert? Otherwise just add `or x==' '`. The overhead is small because it will be the last thing to look for. – Remi Sep 13 '11 at 18:45
  • @Peter: Yea, but from this answer, it was easy enough to make my own function exactly fitting my needs. `c.isalpha()` is close enough to what I searched for. Of course, it's still not perfect (and you gave a good example in your comment on the question about different "p"s). – Albert Sep 13 '11 at 19:42
  • 3
    This isn't safe on Windows. First off, you need to protect against legacy device filenames like CON and NUL. Second, what about case sensitivity? You might overwrite another file on accident. Third, filenames with spaces at the end aren't handled correctly by Python on Windows. That's at least three ways to break it off the top of my head. – Antimony Oct 01 '12 at 14:41
  • for your 3rd remark, I added `rstrip()`. As for CON and NUL etc., perhaps the desired file can be checked to end only with one out of a fixed list of allowed file extensions? As for case sensitivity and file-overwrite: the filename is a valid name at least, next step should be checking if the file not already exists before you overwrite (e.g. use `os.path.exists()`) – Remi Oct 03 '12 at 09:29
  • There even is `str.isalnum()` which does alphanumeric on one step. – Martin Ueding Dec 08 '12 at 21:24
  • 2
    To *not* strip out the period (full stop) `.` try ` "".join(c for c in filename if c.isalnum() or c in [' ', '.']).rstrip()` – danodonovan Apr 12 '13 at 11:42
  • Unicode characters can cause problems on some older filesystems - it's probably best to use unidecode or similar to convert characters to safe ASCII characters. Also, it [might be a good idea to remove spaces](http://stackoverflow.com/a/2306003/210945). – naught101 Apr 29 '14 at 01:28
  • A tiny subset of the potentially **dangerous** file names this would pass through: a 5 gigabyte-long file name, `.......`, `nul`, `dir.exe`, the empty string. – Bob Stein Nov 25 '15 at 11:34
16

My requirements were conservative ( the generated filenames needed to be valid on multiple operating systems, including some ancient mobile OSs ). I ended up with:

    "".join([c for c in text if re.match(r'\w', c)])

That white lists the alphanumeric characters ( a-z, A-Z, 0-9 ) and the underscore. The regular expression can be compiled and cached for efficiency, if there are a lot of strings to be matched. For my case, it wouldn't have made any significant difference.

Ngure Nyaga
  • 2,989
  • 1
  • 20
  • 30
11

More or less what has been mentioned here with regexp, but in reverse (replace any NOT listed):

>>> import re
>>> filename = u"ad\nbla'{-+\)(ç1?"
>>> re.sub(r'[^\w\d-]','_',filename)
u'ad_bla__-_____1_'
Filipe Pina
  • 2,201
  • 23
  • 35
  • 3
    Just use `\W` which "Matches anything other than a letter, digit or underscore. Equivalent to `[^a-zA-Z0-9_]`". – Escape0707 Jan 25 '21 at 10:08
9

I don't recommend using any of the other answers. They're bloated, use bad techniques, and replace tons of legal characters (some even removed all Unicode characters, which is nuts since they're legal in filenames). A few of them even import huge libraries just for this tiny, easy job... that's crazy.

Here's a regex one-liner which efficiently replaces every illegal filesystem character and nothing else. No libraries, no bloat, just a perfectly legal filename in one simple command.

Reference: https://en.wikipedia.org/wiki/Filename#Reserved_characters_and_words

Regex:

clean = re.sub(r"[/\\?%*:|\"<>\x7F\x00-\x1F]", "-", dirty)

Usage:

import re

# Here's a dirty, illegal filename full of control-characters and illegal chars.
dirty = "".join(["\\[/\\?%*:|\"<>0x7F0x00-0x1F]", chr(0x1F) * 15])

# Clean it in one fell swoop.
clean = re.sub(r"[/\\?%*:|\"<>\x7F\x00-\x1F]", "-", dirty)

# Result: "-[----------0x7F0x00-0x1F]---------------"
print(clean)

This was an extreme example where almost every character is illegal, because we constructed the dirty string with the same list of characters that the regex removes, and we even padded with a bunch of "0x1F (ascii 31)" at the end just to show that it also removes illegal control-characters.

This is it. This regex is the only answer you need. It handles every illegal character on modern filesystems (Mac, Windows and Linux). Removing anything more beyond this would fall under the category of "beautifying" and has nothing to do with making legal disk filenames.


More work for Windows users:

After you've run this command, you could optionally also check the result against the list of "special device names" on Windows (a case-insensitive list of words such as "CON", "AUX", "COM0", etc).

The illegal words can be found at https://en.wikipedia.org/wiki/Filename#Comparison_of_filename_limitations in the "Reserved words" and "Comments" columns for the NTFS and FAT filesystems.

Filtering reserved words is only necessary if you plan to store the file on a NTFS or FAT-style disk. Because Windows reserves certain "magic filenames" for internal usage. It reserves them case-insensitively and without caring about the extension, meaning that for example aux.c is an illegal filename on Windows (very silly).

All Mac/Linux filesystems don't have silly limitations like that, so you don't have to do anything else if you're on a good filesystem. Heck, in fact, most of the "illegal characters" we filtered out in the regex are Windows-specific limitations. Mac/Linux filesystems can store most of them. But we filter them anyway since it makes the filenames portable to Windows machines.

Mitch McMabers
  • 3,634
  • 28
  • 27
  • Not sure if this is the point of the question, but this would be perfect if it also filtered out things like double spaces or ending names in periods. Just a note to anyone, more checks are needed! – JaffaKetchup May 24 '22 at 20:14
  • This is a good answer, an essential piece of the puzzle. I found another answer that is an invertible function. Together they solved my problem :) – Yan King Yin Oct 22 '22 at 07:59
7

There are a few reasonable answers here, but in my case I want to take something which is a string which might have spaces and punctuation and rather than just removing those, i would rather replace it with an underscore. Even though spaces are an allowable filename character in most OS's they are problematic. Also, in my case if the original string contained a period I didn't want that to pass through into the filename, or it would generate "extra extensions" that I might not want (I'm appending the extension myself)

def make_safe_filename(s):
    def safe_char(c):
        if c.isalnum():
            return c
        else:
            return "_"
    return "".join(safe_char(c) for c in s).rstrip("_")

print(make_safe_filename( "hello you crazy $#^#& 2579 people!!! : die!!!" ) + ".gif")

prints:

hello_you_crazy_______2579_people______die___.gif

Ronan Boiteau
  • 9,608
  • 6
  • 34
  • 56
uglycoyote
  • 1,555
  • 1
  • 19
  • 25
  • 2
    I think that function might be better if repeated underscores were replaced with a single underscore. `re.sub('_{2,}', '_', 'hello_you_crazy_______2579_people______die___.gif')` `` `>> 'hello_you_crazy_2579_people_die_.gif'` – Xevion Mar 18 '20 at 17:07
  • @Xevion But that increases the chance even more that different strings are mapped to the same filename. – BlackJack Feb 26 '21 at 16:40
7

No solutions here, only problems that you must consider:

  • what is your minimum maximum filename length? (e.g. DOS supporting only 8-11 characters; most OS don't support >256 characters)

  • what filenames are forbidden in some context? (Windows still doesn't support saving a file as CON.TXT -- see https://blogs.msdn.microsoft.com/oldnewthing/20031022-00/?p=42073)

  • remember that . and .. have specific meanings (current/parent directory) and are therefore unsafe.

  • is there a risk that filenames will collide -- either due to removal of characters or the same filename being used multiple times?

Consider just hashing the data and using the hexdump of that as a filename?

Dragon
  • 2,017
  • 1
  • 19
  • 35
7

If you don't mind to import other packages, then werkzeug has a method for sanitizing strings:

from werkzeug.utils import secure_filename

secure_filename("hello.exe")
'hello.exe'
secure_filename("/../../.ssh")
'ssh'
secure_filename("DROP TABLE")
'DROP_TABLE'

#fork bomb on Linux
secure_filename(": () {: |: &} ;:")
''

#delete all system files on Windows
secure_filename("del*.*")
'del'

https://pypi.org/project/Werkzeug/

Anders_K
  • 982
  • 9
  • 28
  • 2
    The tool is an entire WSGI application, the relevant code can be found here: https://github.com/pallets/werkzeug/blob/a3b4572a34269efaca4d91fa4cd07dd7f6f94b6d/src/werkzeug/utils.py#L174-L218 *(Note: Never randomly trust code from the internet when it comes to security, please validate the code before using)*!! – Torxed Jan 05 '22 at 11:48
  • I entirely agree and endorse. Do as @Torxed said and get the code rather than the entire library, and never trust security advice from strangers :P – Anders_K Feb 17 '22 at 13:24
  • **That is a horrible library if you need to support unicode/UTF.** It strips all unicode characters from the filenames. Besides, who the heck needs a library for something that a regex can do in one line? See this answer: https://stackoverflow.com/a/71199182/8874388 – Mitch McMabers Feb 20 '22 at 22:01
  • Regex is hard to read, and with a library you get at least some level of community-proven testing. I would not want unicode in a filename. Chinese signs and Rocket-emoji makes poor file names. – Anders_K Feb 21 '22 at 09:01
  • @Anders_K If you prefer removing Unicode characters, that's on you and falls into the category of "beautification". Chinese letters, Scandiavian åäö, etc, are legal in filesystems. The list of ACTUALLY-illegal characters is well documented on Wikipedia (linked in my answer). Regex is easy to read, and this is a very simple regex which just removes the illegal characters and the range of ASCII 0-31 which are illegal control characters as mentioned on wikipedia. I just point this out because the regex is very, very simple, and sanitization doesn't need any library. Extra cleanup is OPTIONAL. ;) – Mitch McMabers Feb 21 '22 at 21:37
  • @MitchMcMabers I must have misunderstood OP. I thought he/she needed to create safe strings from potentially malicious users. – Anders_K Mar 02 '22 at 07:19
6

The problem with many other answers is that they only deal with character substitutions; not other issues.

Here is a comprehensive universal solution. It handles all types of issues for you, including (but not limited too) character substitution. It should cover all the bases.

Works in Windows, *nix, and almost every other file system.

def txt2filename(txt, chr_set='printable'):
    """Converts txt to a valid filename.

    Args:
        txt: The str to convert.
        chr_set:
            'printable':    Any printable character except those disallowed on Windows/*nix.
            'extended':     'printable' + extended ASCII character codes 128-255
            'universal':    For almost *any* file system. '-.0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
    """

    FILLER = '-'
    MAX_LEN = 255  # Maximum length of filename is 255 bytes in Windows and some *nix flavors.

    # Step 1: Remove excluded characters.
    BLACK_LIST = set(chr(127) + r'<>:"/\|?*')                           # 127 is unprintable, the rest are illegal in Windows.
    white_lists = {
        'universal': {'-.0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'},
        'printable': {chr(x) for x in range(32, 127)} - BLACK_LIST,     # 0-32, 127 are unprintable,
        'extended' : {chr(x) for x in range(32, 256)} - BLACK_LIST,
    }
    white_list = white_lists[chr_set]
    result = ''.join(x
                     if x in white_list else FILLER
                     for x in txt)

    # Step 2: Device names, '.', and '..' are invalid filenames in Windows.
    DEVICE_NAMES = 'CON,PRN,AUX,NUL,COM1,COM2,COM3,COM4,' \
                   'COM5,COM6,COM7,COM8,COM9,LPT1,LPT2,' \
                   'LPT3,LPT4,LPT5,LPT6,LPT7,LPT8,LPT9,' \
                   'CONIN$,CONOUT$,..,.'.split(',')  # This list is an O(n) operation.
    if result in DEVICE_NAMES:
        result = f'{FILLER}{result}{FILLER}'

    # Step 3: Truncate long files while preserving the file extension.
    if len(result) > MAX_LEN:
        if '.' in txt:
            result, _, ext = result.rpartition('.')
            ext = '.' + ext
        else:
            ext = ''
        result = result[:MAX_LEN - len(ext)] + ext

    # Step 4: Windows does not allow filenames to end with '.' or ' ' or begin with ' '.
    result = re.sub(r'^[. ]', FILLER, result)
    result = re.sub(r' $', FILLER, result)

    return result

It replaces non-printable characters even if they are technically valid filenames because they are not always simple to deal with.

No external libraries needed.

ChaimG
  • 7,024
  • 4
  • 38
  • 46
  • I think the last two replacements contain a bug. You apparently wanted to write `result = re.sub(r"[. ]$", FILLER, result)` `result = re.sub(r"^ ", FILLER, result)`. Or alternatively, just `result = re.sub(r"(^ |[. ]$)", FILLER, result)` would also work. – Roland Pihlakas Jan 02 '22 at 13:11
  • There is also a bug on the line `'CONIN$,CONOUT$,..,.'.split()`. It should be `'CONIN$,CONOUT$,..,.'.split(',')` instead. Note the added `','` argument. Else the `split()` operation does nothing. – Roland Pihlakas Aug 13 '23 at 00:47
  • @Roland Pihlakas: Nice catch. Fixed! – ChaimG Aug 20 '23 at 15:37
3

I admit there are two schools of thought regarding DIY vs dependencies. But I come from the firm school of thought that prefers not to reinvent wheels, and to see canonical approaches to simple tasks like this. To wit I am a fan of the pathvalidate library

https://pypi.org/project/pathvalidate/

Which includes a function sanitize_filename() which does what you're after.

I would preference this to any one of the numerous home baked solutions. In the ideal I'd like to see a sanitizer in os.path which is sensitive to filesystem differences and does not do unnecessary sanitising. I imagine pathvalidate takes the conservative approach and produces valid filenames that can span at least NTFS and ext4 comfortably, but it's hard to imagine it even bothers with old DOS constraints.

Bernd Wechner
  • 1,854
  • 1
  • 15
  • 32
1

Extra note for all other answers

Add hash of original string to the end of filename. It will prevent conflicts in case your conversion makes same filename from different strings.

Alexander C
  • 3,597
  • 1
  • 23
  • 39
0

Here is what I came with, being inspired by uglycoyote:

import time

def make_safe_filename(s):
    def safe_char(c):
        if c.isalnum() or c=='.':
            return c
        else:
            return "_"

    safe = ""
    last_safe=False
    for c in s:
      if len(safe) > 200:
        return safe + "_" + str(time.time_ns() // 1000000)

      safe_c = safe_char(c)
      curr_safe = c != safe_c
      if not last_safe or not curr_safe:
        safe += safe_c
      last_safe=curr_safe
    return safe

And to test:

print(make_safe_filename( "hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!hello you crazy $#^#& 2579 people!!! : hi!!!" ) + ".gif")
Martin Kunc
  • 321
  • 3
  • 7
0

Another approach is to specify a replacement for any unwanted symbol. This way filename may look more readable.

>>> substitute_chars = {'/':'-', ' ':''}
>>> filename = 'Cedric_Kelly_12/10/2020 7:56 am_317168.pdf'
>>> "".join(substitute_chars.get(c, c) for c in filename)
'Cedric_Kelly_12-10-20207:56am_317168.pdf'
Dmitry
  • 1
0

Python:

for c in r'[]/\;,><&*:%=+@!#^()|?^':
    filename = filename.replace(c,'')

(just an example of characters you will want to remove) The r in front of the string makes sure the string is interpreted in it's raw format, allowing you to remove backslash \ as well

Edit: regex solution in Python:

import re
re.sub(r'[]/\;,><&*:%=+@!#^()|?^', '', filename)
Remi
  • 20,619
  • 8
  • 57
  • 41
  • 4
    There might be infinite many characters which might be strange. It is not really a solution to add more and more to that list over the time. – Albert Sep 13 '11 at 17:50
  • I see; are the ALLOWED characters known? – Remi Sep 13 '11 at 17:55
  • I don't really know how to define the allowed chars. Basically I mean all chars which can be displayed and don't have some strange behavior (in that they have negative width or add a newline or so). That is what I mean with 'sane'. That is basically the whole question, because otherwise, it would be trivial. – Albert Sep 13 '11 at 17:59
  • I think you rather want "][[]" to capture both "[" and "]". I'm not sure though – ealfonso Sep 23 '13 at 19:35
  • 1
    @Albert: Unicode is not infinite, and as a user if I'm going to input a file name I don't really want strange program logic to decide what I may or may not put in there. Removing just enough to ensure safety (such as directory separators and relative path markers like `.` and `..`) is fine, but removing more? I'm not sure. – Clément Dec 12 '15 at 01:27
  • 2
    Quite sure this regex is wrong. [ is a special char in regex. – Carson Ip Jan 03 '20 at 09:09
  • -1. The regex solution is clearly untested. As @CarsonIp points out, it uses regex-reserved characters, not only `[`, but also `]*+^?|`. Because of this, the regex fails to compile. Also, this approach just doesn't work well generally, because as the OP points out, a character blacklist simply doesn't scale well at all, so a whitelist is probably preferable. – Graham Mar 31 '20 at 20:36