I am trying to write a small script in python 3 to sanitise filenames before they are uploaded to a cloud solution. This needs to run the same on unix and windows systems (including macs). Linux and mac allow characters in file and directory names that windows does not, and for this reason files with these characters simply cannot be uploaded, which is why the script is required.
I am utilising os.walk()
to scan through the files and directories, but while the regex for my first check ('[\\\\/":<>|*?]'
) runs without issues on my linux test, it does not work when actually run from windows.
Given for example a file named hello?\This is a file
, python will read it as 'hello\uf03f\uf05cThis is a file'
and the regex will of course not match. I have tried converting it to bytes then decoding it, encoding and decoding it and using a byte string as the path and decoding all as suggested by various semi-related SO posts, but nothing seems to give me the original characters.
Is anyone able to suggest anything I can do besides adding the sequences to the regex, which would be my last resort if I can't find the real solution?
Example of what I am testing with (invalid files created by mounting drive to linux):
- C:\Users\username\Desktop:
- shortcut.lnk
- text file.txt
- |\invalid??.txt
for dirpath, dirnames, filenames in os.walk("C:\\Users\\username\\Desktop"):
for file in filenames:
print(file)
Outputs:
'shortcut.lnk'
'text file.txt'
'\uf07c\uf05cinvalid\uf03f\uf03f.txt'