Portable (cross platform) scripting with unicode filenames

Question

That's driving me crazy. Have the next bash script.

testdir="./test.$$"
echo "Creating a testing directory: $testdir"
mkdir "$testdir"
cd "$testdir" || exit 1

echo "Creating a file word.txt with content á.txt"
echo 'á.txt' > word.txt

fname=$(cat word.txt)
echo "The word.txt contains:$fname"

echo "creating a file $fname with a touch"
touch $fname
ls -l

echo "command: bash cycle"
while read -r line
do
    [[ -e "$line" ]] && echo "$line is a file"
done < word.txt

echo "command: find . -name $fname -print"
find . -name $fname -print

echo "command: find . -type f -print | grep $fname"
find . -type f -print | grep "$fname"

echo "command: find . -type f -print | fgrep -f word.txt"
find . -type f -print | fgrep -f word.txt

On the Freebsd (and probably on Linux too) gives the result:

Creating a testing directory: ./test.64511
Creating a file word.txt with content á.txt
The word.txt contains:á.txt
creating a file á.txt with a touch
total 1
-rw-r--r--  1 clt  clt  7  3 júl 12:51 word.txt
-rw-r--r--  1 clt  clt  0  3 júl 12:51 á.txt
command: bash cycle
á.txt is a file
command: find . -name á.txt -print
./á.txt
command: find . -type f -print | grep á.txt
./á.txt
command: find . -type f -print | fgrep -f word.txt
./á.txt

Even in the Windows 7 (with cygwin installed) running the script gives correct result.

But when i run this script on OS X bash, got this:

Creating a testing directory: ./test.32534
Creating a file word.txt with content á.txt
The word.txt contains:á.txt
creating a file á.txt with a touch
total 8
-rw-r--r--  1 clt  staff  0  3 júl 13:01 á.txt
-rw-r--r--  1 clt  staff  7  3 júl 13:01 word.txt
command: bash cycle
á.txt is a file
command: find . -name á.txt -print
command: find . -type f -print | grep á.txt
command: find . -type f -print | fgrep -f word.txt

So, only the bash found the file á.txt no, find nor grep. :(

Asked first on apple.stackexchange and one answer suggesting to use the iconv for converting filenames.

$ find . -name $(iconv -f utf-8 -t utf-8-mac <<< á.txt)

While this is works for the "OS X", but it is terrible anyway. (needing enter another command for every utf8 string what entering to the terminal.)

I'm trying to find an general cross platform bash programming solution. So, the questions are:

Why on the OS X the bash "found" the file and the find doesn't?

and

How to write cross-platform bash script where unicode filenames are stored in a file.
the only solution is write special versions only for OS X with the iconv?
exists portable solution for other scripting languages like perl and so?

Ps: and finally, not really programming question, but wondering what is rationale behind Apple's decision using decomposed filenames what doesn't play nicely with command line utf8

EDIT

Simple od.

$ ls | od -bc
0000000   141 314 201 056 164 170 164 012 167 157 162 144 056 164 170 164
           a   ́    **   .   t   x   t  \n   w   o   r   d   .   t   x   t
0000020   012                                                            
          \n

and

$ od -bc word.txt
0000000   303 241 056 164 170 164 012                                    
           á  **   .   t   x   t  \n                                    
0000007

so the

$ while read -r line; do echo "$line" | od -bc; done < word.txt
0000000   303 241 056 164 170 164 012                                    
           á  **   .   t   x   t  \n                                    
0000007

and outpout from a find is the same as ls

$ find . -print | od -bc
0000000   056 012 056 057 167 157 162 144 056 164 170 164 012 056 057 141
           .  \n   .   /   w   o   r   d   .   t   x   t  \n   .   /   a
0000020   314 201 056 164 170 164 012                                    
           ́    **   .   t   x   t  \n

So, the content of word.txt IS DIFFERENT what file is created from its content. Therefore, still havent explanation why the bash found the file.

score 5 · Accepted Answer · answered Jul 03 '13 at 13:07

5

Unicode is hard. Repeat it every time you brush your teeth.

Your á.txt filename contains 5 characters, of which á is the troublesome one. There is more than one way to represent á as a sequence of Unicode code points. There's the precomposed representation, and the decomposed one. Unfortunately most software is not prepared to deal with characters, settling for code points instead (yes most software is cr*p). This means that given precomposed and decomposed representations of the same character, software will not recognize them as the same.

You have a precomposed á, represented as Unicode code point U+00E1 LATIN SMALL LETTER A WITH ACUTE. Windows uses the precomposed representation. Mac filesystems insist on the decomposed representation (well, mostly; utf-8-mac does not decompose certain character ranges, but á is decomposed OK). So on a mac your á becomes U+0061 LATIN SMALL LETTER A followed by U+0301 COMBINING ACUTE ACCENT (writing off the top of my head, not having a Mac handy). Linux filesystems accept whatever you throw at them.

If you give find a precomposed á, it will not find a file with a decomposed á in its name, because it's not prepared to deal with this brouhaha.

So what's the solution? There isn't any. If you want to handle Unicode, you have to work around defects of the common tools.

Here's one slightly less ugly workaround. Write a small bash function (using iconv or whatever) that for each system will convert a representation acceptable on that system, and use it throughout. Let's call it u8:

find . -name $(u8 $myfilename) -print 
find . -name -type f -print | fgrep $(u8 $myfilename)

and so on. Pretty it's not, but it should work.

Oh and I think we all should start sending bug reports for this cr*p. Our software should eventually strive to understand basic human concepts like characters (I'm not even starting to talk about strings). Code points just don't cut it, sorry, even if they're Unicode code points.

answered Jul 03 '13 at 13:07

n. m. could be an AI

112,515
14
128
243

So, the `bash` (compiled for OS X) correctly transforming internally the precomposed `á` when looking for filenames into decomposed `á` and other utilities, like find, grep etc.. doesn't. So the way write "portable" scripts is using only "pure bash" when possible... right? – clt60 Jul 03 '13 at 13:14
*So, the bash (compiled for OS X) correctly transforming internally the precomposed á* Well why not? I haven't looked at the source but it's the right thing to do. – n. m. could be an AI Jul 03 '13 at 13:22
ok, accepting the answer - mainly for: "unicode is hard" and "the software is crap" :) :) - with my addition: most os the software is an level from the 19th century and knows nothing about users needs. Especially software companies, Apple too :(. Linux is (mostly) ok. Shigh. Thank you :) – clt60 Jul 03 '13 at 13:28
I'm pretty sure bash's `-e` test works because it's handed off to the filesystem code (which does this stuff right) instead of trying to do it in bash. find, on the other hand, does its own comparisons and doesn't know what encoding the filesystem uses, and hence doesn't really know what the right thing to do is; it can't, because the "right thing" depends on the filesystem! On linux, "á.txt" (precomposed version) and "á.txt" (decomposed version) are *different filenames*, while on OS X they're different ways of writing the same filename. – Gordon Davisson Jul 03 '13 at 13:51
1

@GordonDavisson: I can rant all day about how Linux treatment of filenames as NUL-terminated byte strings is a huge bug. But an application program can actually ignore the fact and treat everything as e.g. UTF8, *and treat canonically equal strings as equal*. So you can end up with two files with "equal" (as far as the application is concerned) names at the same time in the same directory. This is not a huge problem, the app just needs to be prepared to handle the situation. It doesn't matter if the app is using precomposed or decomposed strings, it just should use one kind consistently. – n. m. could be an AI Jul 03 '13 at 15:54
@n.m. Yup. And while we're at it, I'd love a version of grep where `grep --anyencoding "á.txt"` would match UTF-8 precomposed OR UTF-8 decomposed OR UTF-16LE precomposed OR... – Gordon Davisson Jul 03 '13 at 16:32
Sounds like someone needs to submit a bug report for `find`'s handling of Unicode (I personally have no idea how), and Linux's lack of canonicalisation for Unicode strings. I totally agree; treating URLs as NUL-terminated byte strings is *criminal*. Even if it's currently by design, that is just a series of nasty, deeply invisible bugs waiting to happen. – iono Dec 29 '20 at 10:54
@iono Linux filesystems have no idea about Unicode. Filenames are *byte* strings by definition. I'm not quite sure it's the job of specifically the `find` utility to compensate for this. – n. m. could be an AI Dec 29 '20 at 12:31

Alfe · Answer 2 · 2013-07-03T11:54:49.930

Creating the file with touch and testing its existence with [[ -e "$line" ]] uses the same encoding, so the file gets found.

Testing its existence using find -name and find -print seem to use different encodings. I propose to pipe the output of find -print into a hexdumper (xxd or od -x or similar). This will probably show you what encoding find uses when using -print (and this will probably also be used when using -name).

The general solution for encoding problems always is: USE JUST ONE ENCODING. In your case you should decide which point is easier to adopt; you can change the encoding at the creation of the file (touch "$(iconv -f utf-8 -t utf-8-mac <<< á.txt)") or similar) or change what you give to find (the solution given in your question already). Since bash itself seems to be coping well with the unicode filenames and only find seems to have this problem, I would also propose to do the necessary converting there. Maybe there even is a configuration option for the Mac OS find version which states which encoding it shall use for -name and -print commands.

Unfortunately no. See the edit. Still haven't explanation why the `bash` found the file. And still havent a solution HOW TO WRITE a SCRIPT what will run correctly on ANY (major) platform, so linux, freebsd, OS X, (and maybe windows/cygwin) too... — clt60, Jul 03 '13 at 12:53

Portable (cross platform) scripting with unicode filenames

2 Answers2