Open, close and copy of unicode files on Windows

Question

I am running into the following error when attempting to perform separate operations for Opening files in another program, closing files in another program and performing system operations such as copy file on a unicode based file name. My current code works fine on a Macintosh but not on Windows. I just started working with unicode file names and the CLI.

UnicodeEncodeError: 'ascii' codec can't encode character u'\u0301' in position 5: ordinal not in range(128)

A simplified close file in another application example goes something like this:

def CloseFile( fileToClose ):
    cmd = [ 'sudo', fmsadmin, 'close', fileToClose, '-u', 'userName', '-p', 'accountName', '-y' ] 
    subprocess.check_output( cmd )

CloseFile( u'ÉürøFile.fmp12' )

I have tried performing a decode before the set of the cmd variable but that is not working.

fileToClose = fileToClose.decode('utf-8')

I can give you an CopyFile() example if you want but this with error out well before the command is called. So you shouldn't need FileMaker Server installed to reproduce the issue.

I'm using shutil.copy( from, to ) for the copy method.

If the `fileToClose` variable is of type `unicode`, you definitely need to `encode()`, not `decode()`. And I'm not sure if Windows uses UTF-8 for file names. — lenz, Oct 16 '18 at 20:23
@lenz Yes, that is what I'm trying to figure out. I have also tried to fileToClose = fileToClose.encode(sys.stdout.encoding) but end up with the same error. At the encode line not the subprocess line. — Keith, Oct 16 '18 at 21:05
AFAIK, the encoding of the STD channels is not related to the file system's encoding of file names and, in addition, to the encoding used by the system calls to which `subprocess` delegates. Have you seen [this thread](https://stackoverflow.com/q/1910275)? — lenz, Oct 17 '18 at 07:25
Yes I did. There where no suggestions on how to install the suggested patch. I tried installing it. I also moved all my code over to Popen and it didn't solve the issue. I took guesses at applying the patch so not sure that the patch took. My steps that I took are documented on the patch page. — Keith, Oct 18 '18 at 17:31
If the first answer in that answer is right, switching to Python 3 might help. Is that not an option? You'll have to move to Py3 sooner or later anyway. — lenz, Oct 18 '18 at 19:31
I am unfortunate and cannot move to Python 3 (our automation system uses 2.7). Python 3 probably would not require encoding and decoding. It would just work since, I believe, everything is unicode from end to end. I'm attempting a REST based solution issue right now and will know in an hour or two if I've solved the issue in this other way. — Keith, Oct 19 '18 at 21:06
Well, Python 3 doesn't magically solve all Unicode/encoding issues. But the linked answers suggest it works in a more controllable way. But in any case, file **name** encoding is more of a pain than file (content) encoding in general, especially in a cross-platform setting. — lenz, Oct 20 '18 at 10:16

Keith · Answer 1 · 2018-12-04T01:50:14.927

OK... I finally figured this out and will provide the warning below. I had copied and pasted the name of the file from another program.

The first two characters in this file name was not combined (ÉürøFile.fmp12). So it was encoded as E´and u¨. Instead of É and ü. Evidently Python 2.7 cannot handle characters that are not combined when trying to perform a command line based on that file name.

So the warning here is to:

Use the repr() function in Python to understand how the string is encoded.
Use tools that support the style of encoding you need
When you compare values be sure that both sides of the compare support the same unicode style (composed versus decomposed).
Finally look up the character that the error points to. I did not... but if I did... I would have found that unicode character 0301 was an accent character. This character: ´

As a follow-up to this issue I found another issue. Once the code above was corrected I was trying to protect the unicode file names by using Pythons built-in zip utilities. Storing the files in a zip file and then unzipping the files with the unicode file names when needed. This was evidently a mistake in Python 2.7 on Windows. When Pythons built in unzip utility unpacked the file it messed up the encoding of the file name on disk (works fine on Mac). The name got munged on Windows and was not recognized by system utilities such as using copy, mv, getSize etc. My work around was to place the loose test files with European and Asian names in a folder on an SMB volumes and have my code work directly on them instead. More lessons learned. I'm hoping I can move to Python 3 in the future and that there are less file based issues.

Looks like these characters are in _decomposed_ form (latin character followed by combining accent character, so repr would be u'E\u0301u\u0308'). You may be able to change tthem to _composed_ form (single characters like u'\xc9\xfc') by doing `unicodedata.normalize('NFC', string)`. Don't know if that will make any difference though. — snakecharmerb, Mar 02 '19 at 19:47
Thanks finally figured this out. My code actually using single characters BUT I had compressed these files using the Macs compression utility which compresses and decompresses them correctly. Unfortunately Pythons expansion on Windows does not correctly name the file on disk after decompression. I have worked around the issue by not using file compression for these types of files. Maybe Python 3.x handles this better but I cannot upgrade from 2.7 yet. — Keith, Mar 03 '19 at 05:14

Open, close and copy of unicode files on Windows

1 Answers1