1

I'm trying to upload files containing special characters on our platform via the exec command but the characters are always interpreted and it fails.

For example if I try to upload a mémo.txt file I get the following error:

/bin/cp: cannot create regular file `/path/to/dir/m\351mo.txt': No such file or directory

The UTF8 is correctly configured on the system and if I run the command on the shell it works fine.

Here is the TCL code: exec /bin/cp $tmp_filename $dest_path

How can I make it work?

Simon
  • 337
  • 1
  • 3
  • 11

2 Answers2

2

The core of the problem is what encoding is being used to communicate with the operating system. For exec and filenames, that encoding is whatever is returned by the encoding system command (Tcl has a pretty good guess at what the correct value for that is when the Tcl library starts up, but very occasionally gets it wrong). On my computer, that command returns utf-8 which says (correctly!) that strings passed to (and received from) the OS are UTF-8.

You should be able to use the file copy command instead of doing exec /bin/cp, which will be helpful here as that's got less layers of trickiness (it avoids going through an external program which can impose its own problems). We'll assume that that's being done:

set tmp_filename "foobar.txt";  # <<< fill in the right value, of course
set dest_path "/path/to/dir/mémo.txt"
file copy $tmp_filename $dest_path

If that fails, we need to work out why. The most likely problems relate to the encoding though, and can go wrong in multiple ways that interact horribly. Alas, the details matter. In particular, the encoding for a path depends on the actual filesystem (it's formally a parameter when the filesystem is created) and can vary on Unix between parts of a path when you have a mount within another mount.

If the worst comes to the worst, you can put Tcl into ISO 8859-1 mode and then do all the encoding yourself (as ISO 8859-1 is the “just use the bytes I tell you” encoding); encoding convertto is also useful in this case. Be aware that this can generate filenames that cause trouble for other programs, but it's at least able to let you get at it.

encoding system iso98859-1
file copy $tmp_filename [encoding convertto utf-8 $dest_path]

Care might be needed to convert different parts of the path correctly in this case: you're taking full responsibility for what's going on.


If you're on Windows, please just let Tcl handle the details. Tcl uses the Wide (Unicode) Windows API directly so you can pretend that none of these problems exist. (There are other problems instead.)

On macOS, please leave encoding system alone as it is correct. Macs have a very opinionated approach to encodings.

Donal Fellows
  • 133,037
  • 18
  • 149
  • 215
  • Thank you very much for your detailed answer! The `encoding system` command returns `iso8859-1`. Does that mean the backend OS (CentOS 6) is not correctly configured, or that the operating OS (Windows 10) is? I already tried the `file copy` command but it says **error copying "/tmp/file7k5kqg" to "/path/to/dir/mémo.txt": no such file or directory**... – Simon Aug 28 '19 at 12:09
  • The command `file copy $tmp_filename [encoding convertto utf-8 $dest_path]` did work though! – Simon Aug 28 '19 at 12:20
  • There is a small typo in the last code listing: `iso98859-1` ==> `iso8859-1` – mrcalvin Aug 28 '19 at 22:23
  • In the end I could make it work by setting `encoding system utf-8` in the filestorage module. – Simon Aug 29 '19 at 08:17
1

I already tried the file copy command but it says error copying "/tmp/file7k5kqg" to "/path/to/dir/mémo.txt": no such file or directory

My reading of your problem is that, for some reason, your Tcl is set to iso8859-1 ([encoding system]), while the executing environment (shell) is set to utf-8. This explains why Donal's suggestion works for you:

encoding system iso8859-1
file copy $tmp_filename [encoding convertto utf-8 $dest_path]

This will safely pass utf-8 encoded bytearray down to any syscall: é or \xc3\xa9 or \u00e9. Watch:

% binary encode hex [encoding convertto utf-8 é] 
c3a9
% encoding system iso8859-1; exec xxd << [encoding convertto utf-8 é] 
00000000: c3a9                                     ..

This is equivalent to [encoding system] also being set to utf-8 (as to be expected in an otherwise utf-8 environment):

% encoding system
utf-8
% exec xxd << é
00000000: c3a9                                     ..

What you are experiencing (without any intervention) seems to be a re-coding of the Tcl internal encoding to iso8859-1 on the way out from Tcl (because of [encoding system], as Donal describes), and a follow-up (and faulty) re-coding of this iso8859-1 value into the utf-8 environment.

Watch the difference (\xe9 vs. \xc3\xa9):

% encoding system iso8859-1
% encoding system
iso8859-1
%  exec xxd << é
00000000: e9

The problem it then seems is that \xe9 is to be interpreted in your otherwise utf-8 env, like:

$ locale
LANG="de_AT.UTF-8"
...
$ echo -ne '\xe9'
?
$ touch `echo -ne 'm\xe9mo.txt'`
touch: m?mo.txt: Illegal byte sequence
$ touch mémo.txt
$ ls mémo.txt 
mémo.txt
$ cp `echo -ne 'm\xe9mo.txt'` b.txt
cp: m?mo.txt: No such file or directory

But:

$ cp `echo -ne 'm\xc3\xa9mo.txt'` b.txt
$ ls b.txt
b.txt

Your options:

(1) You need to find out why Tcl picks up iso8859-1, to begin with. How did you obtain your installation? Self-compiled? What are the details (version)?

(2) You may proceed as Donal suggests, or alternatively, set encoding system utf-8 explicitly.

encoding system utf-8
file copy $tmp_filename $dest_path
mrcalvin
  • 3,291
  • 12
  • 18
  • Thank you for the explanation! I could make it work by setting `encoding system utf-8` as you suggested. – Simon Aug 29 '19 at 08:16
  • Glad it helped, but beware, this is only a workaround, not a permanent fix. You need to find out why your Tcl becomes initialised into a `iso8859-1` rather than `utf-8` mode. – mrcalvin Aug 29 '19 at 19:51
  • Yes I will continue to investigate. I did not install it, it was included in this application that we are using: http://www.project-open.com/ – Simon Aug 30 '19 at 07:19
  • This is based on https://openacs.org, you might want to reach out to their [forums](https://openacs.org/forums/forum-view?forum_id=14014). Besides, try to find out whether in your `config.tcl`, there is some parameter `systemencoding` set to `iso8859-1`. – mrcalvin Aug 30 '19 at 19:15