0

I am trying to write a small script in python 3 to sanitise filenames before they are uploaded to a cloud solution. This needs to run the same on unix and windows systems (including macs). Linux and mac allow characters in file and directory names that windows does not, and for this reason files with these characters simply cannot be uploaded, which is why the script is required.

I am utilising os.walk() to scan through the files and directories, but while the regex for my first check ('[\\\\/":<>|*?]') runs without issues on my linux test, it does not work when actually run from windows.

Given for example a file named hello?\This is a file, python will read it as 'hello\uf03f\uf05cThis is a file' and the regex will of course not match. I have tried converting it to bytes then decoding it, encoding and decoding it and using a byte string as the path and decoding all as suggested by various semi-related SO posts, but nothing seems to give me the original characters.

Is anyone able to suggest anything I can do besides adding the sequences to the regex, which would be my last resort if I can't find the real solution?

Example of what I am testing with (invalid files created by mounting drive to linux):

  • C:\Users\username\Desktop:
    • shortcut.lnk
    • text file.txt
    • |\invalid??.txt

    for dirpath, dirnames, filenames in os.walk("C:\\Users\\username\\Desktop"):
        for file in filenames:
            print(file)

Outputs:


    'shortcut.lnk'
    'text file.txt'
    '\uf07c\uf05cinvalid\uf03f\uf03f.txt'

martineau
  • 119,623
  • 25
  • 170
  • 301
Ben
  • 1
  • 1
    Could you just remove `utf-8 literals` from every filename via `s.encode('ascii',errors='ignore').decode('ascii')`? – PacketLoss Oct 12 '20 at 23:21
  • 1
    While this isn't directly related to the current issue, you might find it useful: https://stackoverflow.com/q/7130885 – AMC Oct 12 '20 at 23:30
  • 1
    You also need to exclude ASCII control characters 1-31. Also, sanitizing names for Windows needs to reserve DOS device names: CON, CONIN$, CONOUT$, NUL, PRN, AUX, LPT1-9, and COM1-9 -- including the base name followed by 0 or more spaces and any dot extension (e.g. `"nul.txt"` and `"con .some_extension"`). I rarely see CONIN$ and CONOUT$ handled or spaces following the base device name correctly ignored, so many systems are vulnerable. – Eryk Sun Oct 13 '20 at 06:07
  • The way that you're accessing the NTFS volume in Linux is configured to automatically translate reserved ASCII characters into the Private Use Area U+F000 - U+F0FF, which are valid filename characters in Windows. An example system that does this is the drvfs filesystem in WSL (e.g. creating a file in "/mnt/c"). When accessed from Linux, these PUA characters are translated back to ASCII. – Eryk Sun Oct 13 '20 at 18:48
  • @ErykSun That makes a lot of sense, thank you. A lot of the mechanisms surrounding encodings are a bit mysterious to me. Would you happen to know if there is any situation where this automatic translation would _not_ occur? So far every situation I've come up with something similar (but not the same) happens. Just wondering if its something I need to properly account for if the upload took place from Windows, or whether I can count on this process to take care of it for us. – Ben Oct 13 '20 at 22:22
  • Also I agree in regards to the control characters, they are on the list of requirements set out by the provider. It's a mix of them matching windows restrictions to keep their solution consistent across windows and macs, along with disregarding common meta and temp files from both. This was just my jumping off point. – Ben Oct 13 '20 at 22:30
  • Cygwin and WSL both use PUA escaping for reserved characters in filenames stored on disk. I don't think the Linux ntfs-3g driver does; its `windows_names` option just denies creating files with reserved names. But anyway, I don't understand why you need actual names in the filesystem in order to test sanitizing names. – Eryk Sun Oct 14 '20 at 06:41
  • Your sanitizing scheme also needs to reserve names with trailing spaces and dots (e.g. "spam ", "spam.", "spam. . .") because the Windows API normally strips them before passing the filename to the kernel. – Eryk Sun Oct 14 '20 at 06:41

0 Answers0