157

I'm looking for a php function that will sanitize a string and make it ready to use for a filename. Anyone know of a handy one?

( I could write one, but I'm worried that I'll overlook a character! )

Edit: for saving files on a Windows NTFS filesystem.

Gordon
  • 312,688
  • 75
  • 539
  • 559
user151841
  • 17,377
  • 29
  • 109
  • 171
  • 1
    Can you be more specific: What is to happen with Umlauts (remove or convert to base character?) What is to happen with special characters? – Pekka Jan 07 '10 at 16:00
  • 1
    For which Filesystem? They differ. See http://en.wikipedia.org/wiki/Filename#Comparison_of_file_name_limitations – Gordon Jan 07 '10 at 16:06
  • Windows :) Need 15 characters. – user151841 Jan 07 '10 at 16:12
  • 1
    I'd like to point out that the "blacklist" solutions suggested in some of the answers are not sufficient, as it is infeasible to check for every possible undesirable character (in addition to special characters, there are characters with accents and umlauts, entire non-english/latin alphabets, control characters, etc. to deal with). So I'd argue that a "whitelist" approach is always better, and normalizing the string (as suggested by Blair McMillan's comment on Dominic Rodger's answer) will allow for natural handling of any letters with accents, umlauts, etc. – Sean the Bean Jun 27 '16 at 21:43
  • A good way maybe using regular expressions, see this python script I made: https://github.com/gsscoder/normalize-fn – gsscoder Nov 08 '19 at 15:36

19 Answers19

185

Making a small adjustment to Tor Valamo's solution to fix the problem noticed by Dominic Rodger, you could use:

// Remove anything which isn't a word, whitespace, number
// or any of the following caracters -_~,;[]().
// If you don't need to handle multi-byte characters
// you can use preg_replace rather than mb_ereg_replace
// Thanks @Łukasz Rysiak!
$file = mb_ereg_replace("([^\w\s\d\-_~,;\[\]\(\).])", '', $file);
// Remove any runs of periods (thanks falstro!)
$file = mb_ereg_replace("([\.]{2,})", '', $file);
Sean Vieira
  • 155,703
  • 32
  • 311
  • 293
  • 1
    this regex returns warning " Unknown modifier '|' ", check at codepad.org/jf6O0OOY – AgelessEssence May 10 '13 at 06:20
  • 2
    @iim.hlk - yep, it was missing the wrapping parenthesis. I've added those now. Thanks! – Sean Vieira May 10 '13 at 11:56
  • This doesn't handle file names like "image.jpeg", it produces "imagejpeg" – JamesHalsall Mar 22 '14 at 21:10
  • @JamesHalsall - correct. I've updated it so it does :-) Thanks for making the answer better! – Sean Vieira Mar 23 '14 at 14:48
  • double check for ']' in file name. may be '\\(\\]' must be '\\(\\)' ? – 23W Oct 27 '14 at 14:11
  • @23W - *wow* that survived for a long time - thanks for helping make the answer better! – Sean Vieira Oct 27 '14 at 14:19
  • 1
    I'm not sure you want to let the colon (:) through on Windows as you can change drives that way (ie "d:\junk.txt" will get converted to d:junk.txt) – Paul Hutchinson Oct 30 '14 at 14:01
  • 2
    there's a flaw in there, you should split it into two and run the check for `..` afterwards. For example `.?.` would end up being `..`. Although since you filter `/` I can't see how you'd exploit that further right now, but it shows why the check for `..` is ineffective here. Better yet probably, don't replace, just reject if it doesn't qualify. – falstro Nov 26 '14 at 08:40
  • Not quite sure why but it doesn't seem to replace colons. Here's an example online: [clicky](http://sandbox.onlinephpfunctions.com/code/2aae43286bbd45a9140c9ab2e90eccda4a674570). I might as well have an error in there, little sleepy :P – Tarulia Dec 08 '14 at 19:53
  • You might also want to check that the file doesn't _begin_ with a `.`. Wouldn't want to overwrite / create hidden files, or things like .htaccess, .htpasswd, etc. – Alex Reinking Feb 13 '15 at 16:59
  • 1
    since i've used your solution, i have to mention, that if you use this solution with utf-8, you should switch to mb_ereg_replace. Otherwise chars will be messed up. – Łukasz Rysiak Aug 27 '15 at 20:05
  • This answer is terrible. Why would you allow the characters `-_~,;:[]()` in a filename?! – Mr Pablo Nov 27 '15 at 11:49
  • 2
    Because none of those values are [illegal on the Windows file system](https://msdn.microsoft.com/en-us/library/windows/desktop/aa365247.aspx) and why loose more information than you have to? You could change the regular expression to simply `[^a-z0-9_-]` if you want to be really restrictive - or just use a generated name and throw away the given name and avoid *all* these problems. :-) – Sean Vieira Nov 27 '15 at 13:15
  • 3
    Note that : is illegal. – JasonXA Jan 28 '16 at 04:29
  • Updated - thanks for helping make the answer better! – Sean Vieira Jan 28 '16 at 04:45
  • 1
    I would add `trim()` to trim spaces before and after, so that copy-pasted ` filename.txt ` would sanitize to `filename.txt` – Slava Mar 20 '16 at 20:32
  • Also, leaving whitespace characters like Tab, New line and Carriage return makes no sense in a file name. I suggest replacing `\s` with a literal space (hit spacebar). As a result: `trim(mb_ereg_replace("([^\w \d\-_~,;\[\]\(\).])", '', $file))`. – Slava Mar 21 '16 at 09:19
  • @falstro `file..name.txt` is a perfectly valid file name. Why would one reject it? – Slava Mar 23 '16 at 10:27
  • @Alph.Dev because the discussion was about the file called `..` (which is typically a hard link to a parent directory), not arbitrary usage within a file name. – falstro Mar 23 '16 at 10:59
  • This will fail to 'make it ready to use for a filename' if the results is too long. – ChrisJJ Aug 30 '16 at 10:42
  • 1
    @Alph.Dev Its not "sense" related, its simply forbidden to use those whitespace characters in Windows: http://stackoverflow.com/a/42058764/318765 @falstro Your suggestion does not make sense as `/` is removed and `..filename` does not target the parent directory. The only filename that could be a problem is `..` or `.hiddenFilen`, but you can handle it with `ltrim()` as mentioned in my answer as well. – mgutt Feb 06 '17 at 08:14
  • @mgutt What is your point? Forbidden or useless, it makes no difference. I suggest to remove/replace them so that we can have a valid filename afterwards. We are sanitizing file names aren't we here? – Slava Feb 06 '17 at 13:18
  • @Alph.Dev It is a difference for this answer. As it is forbidden the answer of SeanVieira is completely wrong because its unsafe to use. That was the point I liked to highlight as it is the most popular answer. – mgutt Feb 06 '17 at 16:47
  • I think using `mb_ereg_replace` for keeping any language character is **the most wise way**, but like this: `mb_regex_encoding("UTF-8");` then `$fixedfilename=mb_ereg_replace('^[\s]+|[^\P{C}]|[\\\\\/\*\:\?\"\>\<\|]+|[\s\.]+$','',$filename);` because we have to remove somethings else like removing useless dots and spaces from end. Also it is better avoid to accept characters like ` and ' and ; and % and & that can have meanings for URL or PHP or HTML. A possible one line fast fixer can be this: [PHP Sandbox](http://sandbox.onlinephpfunctions.com/code/76c2b08763b47d822d6c9c204dc9e976c2582fb3) – MMMahdy-PAPION Dec 31 '20 at 17:52
  • These patterns are poorly constructed and demonstrate a general lack of understanding of basic regex entities. – mickmackusa Apr 06 '23 at 00:24
86

This is how you can sanitize filenames for a file system as asked

function filter_filename($name) {
    // remove illegal file system characters https://en.wikipedia.org/wiki/Filename#Reserved_characters_and_words
    $name = str_replace(array_merge(
        array_map('chr', range(0, 31)),
        array('<', '>', ':', '"', '/', '\\', '|', '?', '*')
    ), '', $name);
    // maximise filename length to 255 bytes http://serverfault.com/a/9548/44086
    $ext = pathinfo($name, PATHINFO_EXTENSION);
    $name= mb_strcut(pathinfo($name, PATHINFO_FILENAME), 0, 255 - ($ext ? strlen($ext) + 1 : 0), mb_detect_encoding($name)) . ($ext ? '.' . $ext : '');
    return $name;
}

Everything else is allowed in a filesystem, so the question is perfectly answered...

... but it could be dangerous to allow for example single quotes ' in a filename if you use it later in an unsafe HTML context because this absolutely legal filename:

 ' onerror= 'alert(document.cookie).jpg

becomes an XSS hole:

<img src='<? echo $image ?>' />
// output:
<img src=' ' onerror= 'alert(document.cookie)' />

Because of that, the popular CMS software Wordpress removes them, but they covered all relevant chars only after some updates:

$special_chars = array("?", "[", "]", "/", "\\", "=", "<", ">", ":", ";", ",", "'", "\"", "&", "$", "#", "*", "(", ")", "|", "~", "`", "!", "{", "}", "%", "+", chr(0));
// ... a few rows later are whitespaces removed as well ...
preg_replace( '/[\r\n\t -]+/', '-', $filename )

Finally their list includes now most of the characters that are part of the URI rerserved-characters and URL unsafe characters list.

Of course you could simply encode all these chars on HTML output, but most developers and me too, follow the idiom "Better safe than sorry" and delete them in advance.

So finally I would suggest to use this:

function filter_filename($filename, $beautify=true) {
    // sanitize filename
    $filename = preg_replace(
        '~
        [<>:"/\\\|?*]|            # file system reserved https://en.wikipedia.org/wiki/Filename#Reserved_characters_and_words
        [\x00-\x1F]|             # control characters http://msdn.microsoft.com/en-us/library/windows/desktop/aa365247%28v=vs.85%29.aspx
        [\x7F\xA0\xAD]|          # non-printing characters DEL, NO-BREAK SPACE, SOFT HYPHEN
        [#\[\]@!$&\'()+,;=]|     # URI reserved https://www.rfc-editor.org/rfc/rfc3986#section-2.2
        [{}^\~`]                 # URL unsafe characters https://www.ietf.org/rfc/rfc1738.txt
        ~x',
        '-', $filename);
    // avoids ".", ".." or ".hiddenFiles"
    $filename = ltrim($filename, '.-');
    // optional beautification
    if ($beautify) $filename = beautify_filename($filename);
    // maximize filename length to 255 bytes http://serverfault.com/a/9548/44086
    $ext = pathinfo($filename, PATHINFO_EXTENSION);
    $filename = mb_strcut(pathinfo($filename, PATHINFO_FILENAME), 0, 255 - ($ext ? strlen($ext) + 1 : 0), mb_detect_encoding($filename)) . ($ext ? '.' . $ext : '');
    return $filename;
}

Everything else that does not cause problems with the file system should be part of an additional function:

function beautify_filename($filename) {
    // reduce consecutive characters
    $filename = preg_replace(array(
        // "file   name.zip" becomes "file-name.zip"
        '/ +/',
        // "file___name.zip" becomes "file-name.zip"
        '/_+/',
        // "file---name.zip" becomes "file-name.zip"
        '/-+/'
    ), '-', $filename);
    $filename = preg_replace(array(
        // "file--.--.-.--name.zip" becomes "file.name.zip"
        '/-*\.-*/',
        // "file...name..zip" becomes "file.name.zip"
        '/\.{2,}/'
    ), '.', $filename);
    // lowercase for windows/unix interoperability http://support.microsoft.com/kb/100625
    $filename = mb_strtolower($filename, mb_detect_encoding($filename));
    // ".file-name.-" becomes "file-name"
    $filename = trim($filename, '.-');
    return $filename;
}

And at this point you need to generate a filename if the result is empty and you can decide if you want to encode UTF-8 characters. But you do not need that as UTF-8 is allowed in all file systems that are used in web hosting contexts.

The only thing you have to do is to use urlencode() (as you hopefully do it with all your URLs) so the filename საბეჭდი_მანქანა.jpg becomes this URL as your <img src> or <a href>: http://www.maxrev.de/html/img/%E1%83%A1%E1%83%90%E1%83%91%E1%83%94%E1%83%AD%E1%83%93%E1%83%98_%E1%83%9B%E1%83%90%E1%83%9C%E1%83%A5%E1%83%90%E1%83%9C%E1%83%90.jpg

Stackoverflow does that, so I can post this link as a user would do it:
http://www.maxrev.de/html/img/საბეჭდი_მანქანა.jpg

So this is a complete legal filename and not a problem as @SequenceDigitale.com mentioned in his answer.

mgutt
  • 5,867
  • 2
  • 50
  • 77
  • Oh... The function works well, but since some time it started putting - between every character, like `r-u-l-e-s` and I have no idea why this happen. Sure is that it is not fault of the function, but just asking - what might be reason of such behavior? Wrong encoding? –  Mar 18 '17 at 14:30
  • 1
    Oh well... Just made a debug and it happens just after the `preg_replace` in `filter_filename()`. –  Mar 18 '17 at 14:46
  • After removing these comments, it started working again. –  Mar 18 '17 at 15:01
  • Which comments did you remove? Send me an email if this is easier: http://gutt.it/contact.htm – mgutt Mar 18 '17 at 17:07
  • those from first `preg_replace`. –  Mar 18 '17 at 18:46
  • Note that mb_strtolower can create `?` and \. – mikeytown2 Mar 29 '17 at 23:18
  • @mikextown2 Are you sure? Should not happen because of `mb_detect_encoding` – mgutt Mar 30 '17 at 00:13
  • I added "u" modifier to the end of the regexp for work with Unicode filenames. – vatavale May 26 '19 at 16:15
  • 4
    Beware: The double backslash in the RegEx must be additionally escaped with a third one for the PHP string. `preg_replace('~[<>:"/\\|?*]~x','-', $filename)` will otherwise let `Hello\World.txt` pass! Change `[<>:"/\\|?*]` to `[<>:"/\\\|?*]` to fix that. – spackmat Feb 06 '20 at 10:15
  • The sanitization on file name length is not working on files with RTL names. With this you will get an completely empty file. – Deckard Oct 26 '22 at 09:52
54

SOLUTION 1 - simple and effective

$file_name = preg_replace( '/[^a-z0-9]+/', '-', strtolower( $url ) );

  • strtolower() guarantees the filename is lowercase (since case does not matter inside the URL, but in the NTFS filename)
  • [^a-z0-9]+ will ensure, the filename only keeps letters and numbers
  • Substitute invalid characters with '-' keeps the filename readable

Example:

URL:  http://stackoverflow.com/questions/2021624/string-sanitizer-for-filename
File: http-stackoverflow-com-questions-2021624-string-sanitizer-for-filename

SOLUTION 2 - for very long URLs

You want to cache the URL contents and just need to have unique filenames. I would use this function:

$file_name = md5( strtolower( $url ) )

this will create a filename with fixed length. The MD5 hash is in most cases unique enough for this kind of usage.

Example:

URL:  https://www.amazon.com/Interstellar-Matthew-McConaughey/dp/B00TU9UFTS/ref=s9_nwrsa_gw_g318_i10_r?_encoding=UTF8&fpl=fresh&pf_rd_m=ATVPDKIKX0DER&pf_rd_s=desktop-1&pf_rd_r=BS5M1H560SMAR2JDKYX3&pf_rd_r=BS5M1H560SMAR2JDKYX3&pf_rd_t=36701&pf_rd_p=6822bacc-d4f0-466d-83a8-2c5e1d703f8e&pf_rd_p=6822bacc-d4f0-466d-83a8-2c5e1d703f8e&pf_rd_i=desktop
File: 51301f3edb513f6543779c3a5433b01c
Philipp
  • 10,240
  • 8
  • 59
  • 71
  • 4
    Maybe MD5 could by a Problem: Be careful when using hashes with URL’s. While the square root of the number http://www.skrenta.com/2007/08/md5_tutorial.html of URL’s is still a lot bigger then the current web size if you do get a collision you are going to get pages about Britney Spears when you were expecting pages about Bugzilla. Its probably a non issue in our case, but for billions of pages I would opt for a much larger hashing algorithm such as SHA 256 or avoid it altogether. Source: https://boyter.org/2013/01/code-for-a-search-engine-in-php-part-1/ – adilbo Jul 11 '18 at 12:58
43

What about using rawurlencode() ? http://www.php.net/manual/en/function.rawurlencode.php

Here is a function that sanitize even Chinese Chars:

public static function normalizeString ($str = '')
{
    $str = strip_tags($str); 
    $str = preg_replace('/[\r\n\t ]+/', ' ', $str);
    $str = preg_replace('/[\"\*\/\:\<\>\?\'\|]+/', ' ', $str);
    $str = strtolower($str);
    $str = html_entity_decode( $str, ENT_QUOTES, "utf-8" );
    $str = htmlentities($str, ENT_QUOTES, "utf-8");
    $str = preg_replace("/(&)([a-z])([a-z]+;)/i", '$2', $str);
    $str = str_replace(' ', '-', $str);
    $str = rawurlencode($str);
    $str = str_replace('%', '-', $str);
    return $str;
}

Here is the explaination

  1. Strip HTML Tags
  2. Remove Break/Tabs/Return Carriage
  3. Remove Illegal Chars for folder and filename
  4. Put the string in lower case
  5. Remove foreign accents such as Éàû by convert it into html entities and then remove the code and keep the letter.
  6. Replace Spaces with dashes
  7. Encode special chars that could pass the previous steps and enter in conflict filename on server. ex. "中文百强网"
  8. Replace "%" with dashes to make sure the link of the file will not be rewritten by the browser when querying th file.

OK, some filename will not be releavant but in most case it will work.

ex. Original Name: "საბეჭდი-და-ტიპოგრაფიული.jpg"

Output Name: "-E1-83-A1-E1-83-90-E1-83-91-E1-83-94-E1-83-AD-E1-83-93-E1-83-98--E1-83-93-E1-83-90--E1-83-A2-E1-83-98-E1-83-9E-E1-83-9D-E1-83-92-E1-83-A0-E1-83-90-E1-83-A4-E1-83-98-E1-83-A3-E1-83-9A-E1-83-98.jpg"

It's better like that than an 404 error.

Hope that was helpful.

Carl.

SequenceDigitale.com
  • 4,038
  • 1
  • 24
  • 23
  • 1
    You are not removing NULL and Control characters. ASCII of 0 to 32 should all be removed from the string. – Basil Musa Dec 21 '15 at 23:00
  • UTF-8 is allowed in the file system and it is allowed in URLs, so why should it produce an 404 error? The only thing you need to do is to encode the URL `http://www.maxrev.de/html/img/საბეჭდი_მანქანა.jpg` to `http://www.maxrev.de/html/img/%E1%83%A1%E1%83%90%E1%83%91%E1%83%94%E1%83%AD%E1%83%93%E1%83%98_%E1%83%9B%E1%83%90%E1%83%9C%E1%83%A5%E1%83%90%E1%83%9C%E1%83%90.jpg` in the HTML source code as you hopefully do with all your URLs. – mgutt Feb 06 '17 at 08:27
  • 1
    Some other points: You remove HTML tags through `strip_tags()` and after that you remove `[<>]`. By that `strip_tags()` is not really needed at all. The same point are the quotes. There are no quotes left when you decode with `ENT_QUOTES`. And the `str_replace()` does not remove consecutive white spaces and then you use `strtolower()` for mult-byte string. And why do you convert to lowercase at all? And finally you did not catch any reserved character as @BasilMusa mentioned. More details in my answer: http://stackoverflow.com/a/42058764/318765 – mgutt Feb 06 '17 at 08:49
  • Why bother creating capture groups that you never use in the replacement? Why not replace `[\r\n\t ]` with `\s`? There is waaaay too much unnecessary escaping in `[\"\*\/\:\<\>\?\'\|]`. – mickmackusa Apr 06 '23 at 01:22
42

Instead of worrying about overlooking characters - how about using a whitelist of characters you are happy to be used? For example, you could allow just good ol' a-z, 0-9, _, and a single instance of a period (.). That's obviously more limiting than most filesystems, but should keep you safe.

Dominic Rodger
  • 97,747
  • 36
  • 197
  • 212
  • 46
    No good for languages with Umlauts. This would result in Qubec for Québec, Dsseldorf for Düsseldorf, and so on. – Pekka Jan 07 '10 at 17:11
  • 17
    True - but like I said: "For example". – Dominic Rodger Jan 07 '10 at 17:13
  • 5
    Which may be perfectly acceptable to the OP. Otherwise, use something like http://php.net/manual/en/class.normalizer.php – Blair McMillan Jan 07 '10 at 17:23
  • 4
    That is actually not what was asked. The op asks for a function to sanitize string, not a alternative. – i.am.michiel Feb 07 '13 at 10:14
  • 4
    @i.am.michiel, perhaps, but given the OP accepted it, I'll assume they found it helpful. – Dominic Rodger Feb 07 '13 at 12:48
  • 1
    For Umlauts you can always include the following snippet: $string = strtr( $string, "ÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÒÓÔÕÖØÙÚÛÜÝàáâãäåçèéêëìíîïðòóôõöøùúûüýÿÑñ", "AAAAAACEEEEIIIIOOOOOOUUUUYaaaaaaceeeeiiiiooooooouuuuyyNn" ); – Ronald Hulshof Mar 01 '14 at 17:35
  • 1
    Not an answer to the question, should be a comment. – Hayley Apr 21 '14 at 21:45
  • Thanks @asdasd, but as I said, the OP accepting it makes me think they found it helpful. – Dominic Rodger Apr 22 '14 at 09:15
  • 1
    @RonaldHulshof: Your snippet does not account for multibyte characters. For that you'd have to create a transformation array with key = umlaut, value = regular char and pass it as second parameter to `strtr()`. Alternatively, use `iconv('UTF-8','ASCII//TRANSLIT',$string);` – Sven Nov 16 '15 at 20:05
  • 1
    Will not work with other alphabets, like `Файл.docx` – Slava Mar 20 '16 at 16:13
  • @BlairMcMillan how would Normalizer help? None of the types of Unicode normalizations seem to have anything to do with guaranteeing the fitness of a string as filename for a particular type filesystem. – matteo Jan 08 '17 at 17:22
  • Is there a regex string for this? – Aaron Esau Mar 03 '18 at 01:50
  • 1
    Can you write an example and post it? – TekOps Jul 08 '20 at 04:47
  • please give the code to it – Matoeil Dec 17 '21 at 11:15
22

Well, tempnam() will do it for you.

http://us2.php.net/manual/en/function.tempnam.php

but that creates an entirely new name.

To sanitize an existing string just restrict what your users can enter and make it letters, numbers, period, hyphen and underscore then sanitize with a simple regex. Check what characters need to be escaped or you could get false positives.

$sanitized = preg_replace('/[^a-zA-Z0-9\-\._]/','', $filename);
Austin
  • 8,018
  • 2
  • 31
  • 37
Mark Moline
  • 453
  • 2
  • 6
15

safe: replace every sequence of NOT "a-zA-Z0-9_-" to a dash; add an extension yourself.

$name = preg_replace('/[^a-zA-Z0-9_-]+/', '-', strtolower($name)).'.'.$extension;

so a PDF called

"This is a grüte test_service +/-30 thing"

becomes

"This-is-a-gr-te-test_service-30-thing.pdf"
commonpike
  • 10,499
  • 4
  • 65
  • 58
  • 1
    You need to add the file extension separated by a ".": $name = preg_replace('/[^a-zA-Z0-9_-]+/', '-', strtolower($name)).'.'.$extension; – Edmunds22 Aug 02 '20 at 15:02
  • What was the matter with `[^\w-]+`? If you are going to unconditional call `strtolower()` on the input, what is the point of including `[A-Z]` in your character class? Should you not use `mb_strtolower()` and add the `u` pattern modifier to ensure that the input text is always parsed as individual bytes? I don't know how those multibyte-unsafe techniques might split (any) multibyte characters -- might it produce an unintended valid character? – mickmackusa Apr 06 '23 at 00:29
14
preg_replace("[^\w\s\d\.\-_~,;:\[\]\(\]]", '', $file)

Add/remove more valid characters depending on what is allowed for your system.

Alternatively you can try to create the file and then return an error if it's bad.

Tor Valamo
  • 33,261
  • 11
  • 73
  • 81
11

PHP provides a function to sanitize a text to different format

filter_var() with second parameter FILTER_SANITIZE_URL

How to use:

echo filter_var(
   "Lorem Ipsum has been the industry's", FILTER_SANITIZE_URL
); 

Sample output:

LoremIpsumhasbeentheindustry's

Sᴀᴍ Onᴇᴌᴀ
  • 8,218
  • 8
  • 36
  • 58
120DEV
  • 805
  • 10
  • 8
  • 3
    Good, but it would not remove slashes, which could be a problem: Directory traversing. – func0der Jun 11 '19 at 10:10
  • On Windows, the list of illegal, common characters for file names is `\ / : * ? " < > |`. EVERY one of those is allowed by the `FILTER_SANITIZE_URL` rule. – thelr May 19 '21 at 13:09
  • As variant - `FILTER_SANITIZE_EMAIL`. Remove all characters except letters, digits and `!#$%&'*+-=?^_\`{|}~@.[]`. – dobs Apr 30 '22 at 10:25
7

Making a small adjustment to Sean Vieira's solution to allow for single dots, you could use:

preg_replace("([^\w\s\d\.\-_~,;:\[\]\(\)]|[\.]{2,})", '', $file)
CarlJohnson
  • 541
  • 4
  • 11
  • Literal dots inside of a character class do not benefit from an escaping backslash. I do not recommend using `(` and `)` as pattern delimiters because it can confuse readers who are new to regex -- they may assume it is a capture group and that there are no delimiters at all. – mickmackusa Apr 06 '23 at 01:26
6

The following expression creates a nice, clean, and usable string:

/[^a-z0-9\._-]+/gi

Turning today's financial: billing into today-s-financial-billing

Sampson
  • 265,109
  • 74
  • 539
  • 565
  • so a filename can't have a period or an underscore, or anything like that? – Tor Valamo Jan 07 '10 at 16:02
  • 2
    @Jonathan - what's with the italics? – Dominic Rodger Jan 07 '10 at 16:04
  • @Tor, yes, sorry. Updated. @Dominic, just drawing emphasis on the text. – Sampson Jan 07 '10 at 16:05
  • What is gism? I get " Warning: preg_replace() [function.preg-replace]: Unknown modifier 'g' " – user151841 Jan 07 '10 at 16:28
  • `g` - global, `i` - insensitive case, `s` - dotall, `m` - multiline. In this example, you could do without `s` and `m`. – Sampson Jan 07 '10 at 16:57
  • 1
    @user151841 For `preg_replace` the global flag is implicit. So there is no need for g if preg_replace is being used. When we want to control the number of replacements preg_replace has a `limit` parameter for that. Read the preg_replace documentation for more. – rineez Aug 02 '14 at 09:00
2

These may be a bit heavy, but they're flexible enough to sanitize whatever string into a "safe" en style filename or folder name (or heck, even scrubbed slugs and things if you bend it).

1) Building a full filename (with fallback name in case input is totally truncated):

str_file($raw_string, $word_separator, $file_extension, $fallback_name, $length);

2) Or using just the filter util without building a full filename (strict mode true will not allow [] or () in filename):

str_file_filter($string, $separator, $strict, $length);

3) And here are those functions:

// Returns filesystem-safe string after cleaning, filtering, and trimming input
function str_file_filter(
    $str,
    $sep = '_',
    $strict = false,
    $trim = 248) {

    $str = strip_tags(htmlspecialchars_decode(strtolower($str))); // lowercase -> decode -> strip tags
    $str = str_replace("%20", ' ', $str); // convert rogue %20s into spaces
    $str = preg_replace("/%[a-z0-9]{1,2}/i", '', $str); // remove hexy things
    $str = str_replace("&nbsp;", ' ', $str); // convert all nbsp into space
    $str = preg_replace("/&#?[a-z0-9]{2,8};/i", '', $str); // remove the other non-tag things
    $str = preg_replace("/\s+/", $sep, $str); // filter multiple spaces
    $str = preg_replace("/\.+/", '.', $str); // filter multiple periods
    $str = preg_replace("/^\.+/", '', $str); // trim leading period

    if ($strict) {
        $str = preg_replace("/([^\w\d\\" . $sep . ".])/", '', $str); // only allow words and digits
    } else {
        $str = preg_replace("/([^\w\d\\" . $sep . "\[\]\(\).])/", '', $str); // allow words, digits, [], and ()
    }

    $str = preg_replace("/\\" . $sep . "+/", $sep, $str); // filter multiple separators
    $str = substr($str, 0, $trim); // trim filename to desired length, note 255 char limit on windows

    return $str;
}


// Returns full file name including fallback and extension
function str_file(
    $str,
    $sep = '_',
    $ext = '',
    $default = '',
    $trim = 248) {

    // Run $str and/or $ext through filters to clean up strings
    $str = str_file_filter($str, $sep);
    $ext = '.' . str_file_filter($ext, '', true);

    // Default file name in case all chars are trimmed from $str, then ensure there is an id at tail
    if (empty($str) && empty($default)) {
        $str = 'no_name__' . date('Y-m-d_H-m_A') . '__' . uniqid();
    } elseif (empty($str)) {
        $str = $default;
    }

    // Return completed string
    if (!empty($ext)) {
        return $str . $ext;
    } else {
        return $str;
    }
}

So let's say some user input is: .....&lt;div&gt;&lt;/div&gt;<script></script>&amp; Weiß Göbel 中文百强网File name %20 %20 %21 %2C Décor \/. /. . z \... y \...... x ./ “This name” is & 462^^ not &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; = that grrrreat -][09]()1234747) საბეჭდი-და-ტიპოგრაფიული

And we wanna convert it to something friendlier to make a tar.gz with a file name length of 255 chars. Here is an example use. Note: this example includes a malformed tar.gz extension as a proof of concept, you should still filter the ext after string is built against your whitelist(s).

$raw_str = '.....&lt;div&gt;&lt;/div&gt;<script></script>&amp; Weiß Göbel 中文百强网File name  %20   %20 %21 %2C Décor  \/.  /. .  z \... y \...... x ./  “This name” is & 462^^ not &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; = that grrrreat -][09]()1234747) საბეჭდი-და-ტიპოგრაფიული';
$fallback_str = 'generated_' . date('Y-m-d_H-m_A');
$bad_extension = '....t&+++a()r.gz[]';

echo str_file($raw_str, '_', $bad_extension, $fallback_str);

The output would be: _wei_gbel_file_name_dcor_._._._z_._y_._x_._this_name_is_462_not_that_grrrreat_][09]()1234747)_.tar.gz

You can play with it here: https://3v4l.org/iSgi8

Or a Gist: https://gist.github.com/dhaupin/b109d3a8464239b7754a

EDIT: updated script filter for &nbsp; instead of space, updated 3v4l link

dhaupin
  • 1,613
  • 2
  • 21
  • 24
2

Use this to accept just words (unicode support such as utf-8) and "." and "-" and "_" in string :

$sanitized = preg_replace('/[^\w\-\._]/u','', $filename);
sj59
  • 2,072
  • 3
  • 22
  • 23
  • Underscore is included in `\w`. Inside of a character class, a `.` doesn't need to be escaped. To make longer matches and fewer replacements, use the `+` quantifier. – mickmackusa Apr 06 '23 at 00:31
1

The best I know today is static method Strings::webalize from Nette framework.

BTW, this translates all diacritic signs to their basic.. š=>s ü=>u ß=>ss etc.

For filenames you have to add dot "." to allowed characters parameter.

/**
 * Converts to ASCII.
 * @param  string  UTF-8 encoding
 * @return string  ASCII
 */
public static function toAscii($s)
{
    static $transliterator = NULL;
    if ($transliterator === NULL && class_exists('Transliterator', FALSE)) {
        $transliterator = \Transliterator::create('Any-Latin; Latin-ASCII');
    }

    $s = preg_replace('#[^\x09\x0A\x0D\x20-\x7E\xA0-\x{2FF}\x{370}-\x{10FFFF}]#u', '', $s);
    $s = strtr($s, '`\'"^~?', "\x01\x02\x03\x04\x05\x06");
    $s = str_replace(
        array("\xE2\x80\x9E", "\xE2\x80\x9C", "\xE2\x80\x9D", "\xE2\x80\x9A", "\xE2\x80\x98", "\xE2\x80\x99", "\xC2\xB0"),
        array("\x03", "\x03", "\x03", "\x02", "\x02", "\x02", "\x04"), $s
    );
    if ($transliterator !== NULL) {
        $s = $transliterator->transliterate($s);
    }
    if (ICONV_IMPL === 'glibc') {
        $s = str_replace(
            array("\xC2\xBB", "\xC2\xAB", "\xE2\x80\xA6", "\xE2\x84\xA2", "\xC2\xA9", "\xC2\xAE"),
            array('>>', '<<', '...', 'TM', '(c)', '(R)'), $s
        );
        $s = @iconv('UTF-8', 'WINDOWS-1250//TRANSLIT//IGNORE', $s); // intentionally @
        $s = strtr($s, "\xa5\xa3\xbc\x8c\xa7\x8a\xaa\x8d\x8f\x8e\xaf\xb9\xb3\xbe\x9c\x9a\xba\x9d\x9f\x9e"
            . "\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3"
            . "\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8"
            . "\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf8\xf9\xfa\xfb\xfc\xfd\xfe"
            . "\x96\xa0\x8b\x97\x9b\xa6\xad\xb7",
            'ALLSSSSTZZZallssstzzzRAAAALCCCEEEEIIDDNNOOOOxRUUUUYTsraaaalccceeeeiiddnnooooruuuuyt- <->|-.');
        $s = preg_replace('#[^\x00-\x7F]++#', '', $s);
    } else {
        $s = @iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $s); // intentionally @
    }
    $s = str_replace(array('`', "'", '"', '^', '~', '?'), '', $s);
    return strtr($s, "\x01\x02\x03\x04\x05\x06", '`\'"^~?');
}


/**
 * Converts to web safe characters [a-z0-9-] text.
 * @param  string  UTF-8 encoding
 * @param  string  allowed characters
 * @param  bool
 * @return string
 */
public static function webalize($s, $charlist = NULL, $lower = TRUE)
{
    $s = self::toAscii($s);
    if ($lower) {
        $s = strtolower($s);
    }
    $s = preg_replace('#[^a-z0-9' . preg_quote($charlist, '#') . ']+#i', '-', $s);
    $s = trim($s, '-');
    return $s;
}
DnD
  • 183
  • 2
  • 7
  • Why do you want to replace diacritics? Simply use `urlencode()` before you use the filename as a `src` or `href`. The only currently used file system that has problems with UTF-8 is FATx (used by XBOX): https://en.wikipedia.org/wiki/Comparison_of_file_systems#Limits And I do not think this is used by web servers – mgutt Feb 06 '17 at 09:05
1

It seems this all hinges on the question, is it possible to create a filename that can be used to hack into a server (or do some-such other damage). If not, then it seems the simple answer to is try creating the file wherever it will, ultimately, be used (since that will be the operating system of choice, no doubt). Let the operating system sort it out. If it complains, port that complaint back to the User as a Validation Error.

This has the added benefit of being reliably portable, since all (I'm pretty sure) operating systems will complain if the filename is not properly formed for that OS.

If it is possible to do nefarious things with a filename, perhaps there are measures that can be applied before testing the filename on the resident operating system -- measures less complicated than a full "sanitation" of the filename.

ReverseEMF
  • 506
  • 7
  • 10
1
function sanitize_file_name($file_name) { 
 // case of multiple dots
  $explode_file_name =explode('.', $file_name);
  $extension =array_pop($explode_file_name);
  $file_name_without_ext=substr($file_name, 0, strrpos( $file_name, '.') );    
  // replace special characters
  $file_name_without_ext = preg_quote($file_name_without_ext);
  $file_name_without_ext = preg_replace('/[^a-zA-Z0-9\\_]/', '_', $file_name_without_ext);
  $file_name=$file_name_without_ext . '.' . $extension;    
  return $file_name;
}
Matoeil
  • 6,851
  • 11
  • 54
  • 77
0

/ and .. in the user provided file name can be harmful. So you should get rid of these by something like:

$fname = str_replace('..', '', $fname);
$fname = str_replace('/',  '', $fname);
Synetech
  • 9,643
  • 9
  • 64
  • 96
gameover
  • 11,813
  • 16
  • 59
  • 70
  • This is insufficient! For example, the filename "./.name" will still break out of the current directory. (Removing .. does nothing here, but removing / will turn the ./. into .. and hence break out of the target directory.) – Colin Emonds Jun 08 '15 at 12:09
  • 3
    @cemper93 No, this answer will just turn the string into `..name` which would not break out of anything. Removing all path separator characters should be sufficient to prevent any directory traversal. (The removal of `..` is technically unnecessary.) – cdhowie Jun 15 '15 at 16:44
  • @cdhowie Yes, but the filename `./.` becomes `..`. And finally this answer misses all other file system reserved characters like NULL. More in my answer: http://stackoverflow.com/a/42058764/318765 – mgutt Feb 06 '17 at 08:59
0

one way

$bad='/[\/:*?"<>|]/';
$string = 'fi?le*';

function sanitize($str,$pat)
{
    return preg_replace($pat,"",$str);

}
echo sanitize($string,$bad);
ghostdog74
  • 327,991
  • 56
  • 259
  • 343
  • 1
    What about non-printable characters? It's better to use the white list approach than black list approach in this case. Basically allow only the printable ASCII file names excluding the special letters of course. But for non-english locales, that's another problem. – TheRealChx101 Oct 24 '18 at 03:06
-4

$fname = str_replace('/','',$fname);

Since users might use the slash to separate two words it would be better to replace with a dash instead of NULL

user2246924
  • 105
  • 3