What's the best way to embed a Unicode character in a POSIX shell script?

Question

There's several shell-specific ways to include a ‘unicode literal’ in a string. For instance, in Bash, the quoted string-expanding mechanism, $'', allows us to directly embed an invisible character: $'\u2620'.

However, if you're trying to write universally cross-platform shell-scripts (generally, this can be truncated to “runs in Bash, Zsh, and Dash.”), that's not a portable feature.

I can portably achieve anything in the ASCII table (octal number-space) with a construct like the following:

WHAT_A_CHARACTER="$(printf '\036')"

… however, POSIX / Dash printf only supports octal escapes.

I can also obviously achieve the full Unicode space by farming the task out to a fuller programming environment:

OH_CAPTAIN_MY_CAPTAIN="$(ruby -e 'print "\u2388"')"
TAKE_ME_OUT_TONIGHT="$(node -e 'console.log("\u266C")')"

So: what's the best way to encode such a character into a shell-script, that:

Works in dash, bash, and zsh,
shows the hexadecimal encoding of the codepoint in the code,
isn't dependant on the particular encoding of the string (i.e. not by encoding UTF-8 bytes in octal)
and finally, doesn't require the invocation of any “heavy” interpreter. (let's say, less than 0.01s runtime.)

Without 2, you can of course have your character verbatim in the source of the script, e.g., `printf '⎈♬\n'`. If you have a decent editor, putting the cursor on it should show the code; and you should be able to enter it too (e.g., Ctrl+Shift+u 2388). I don't see why 2 is really an issue. — gniourf_gniourf, Dec 26 '14 at 08:50
@gniourf_gniourf the issue with 2 is basically that it *requires* a decent editor. There's plenty of situations where I want my source-code to be accessible to those without that luxury. Having special characters that are crucial to the function of the program encoded in an accessible way opens up development of the source to a larger contributor-group. This isn't *always* (or even often!) a concern, but sometimes, it's worth considering. ;) — ELLIOTTCABLE, Jan 03 '15 at 22:59
@gniourf_gniourf (there's plenty of situations, even in freaking 2015, where verbatim Unicode-encoded documents will get mangled by pipelines assuming simple ASCII or ISO-8859-1. It's a sad truth.) — ELLIOTTCABLE, Jan 03 '15 at 23:00
This is a great question, and I agree with your point about including bare glyphs. But: - #2 is somewhat self-redundant: this is your specified _input format_, so it will only show a hex code if you put one there. A codepoint is just a 32-bit integer associated with a glyph, and the hexademial U+09AF format is just a convention. https://en.wikipedia.org/wiki/Unicode#Architecture_and_terminology - Number three is nonsensical. You have to have the string encoded _somehow_; as you already mention, POSIX `printf` only allows for octal escapes. So with a locale UTF-8, that's simply the only way. — Geoff Nixon, Jun 26 '15 at 17:44
@CharlesDuffy for my personal application of these issues, it's because I have random selections from large swathes of the Unicode tables memorized for old-school Alt-code input on Windows, back in the day. :P — ELLIOTTCABLE, Jul 20 '15 at 15:40

rici · Accepted Answer · 2015-06-26T17:47:44.910

13

If you have Gnu printf installed (it's in debian package coreutils, for example), then you can use it independent of which shell you are using by avoiding the shell's builtin:

env printf '\u2388\n'

Here I am using the Posix-standard env command to avoid the use of the printf builtin, but if you happen to know where printf is you could do this directly by using the complete, path, such as

/usr/bin/printf '\u2388\n'

If both your external printf and your shell's builtin printf only implement the Posix standard, you need to work harder. One possibility is to use iconv to translate to UTF-8, but while the Posix standard requires that there be an iconv command, it does not in any way prescribe the way standard encodings are named. I think the following will work on most Posix-compatible platforms, but the number of subshells created might be sufficient to make it less efficient than a "heavy" script interpreter:

printf $(printf '\\%o' $(printf %08x 0x2388 | sed 's/../0x& /g')) |
iconv -f UTF-32BE -t UTF-8

The above uses the printf builtin to force the hexadecimal codepoint value to be 8 hex digits long, then sed to rewrite them as 4 hex constants, then printf again to change the hex constants into octal notation and finally another printf to interpret the octal character constants into a four-byte sequence which can be fed into iconv as big-endian UTF-32. (It would be simpler with a printf which recognizes \x escape codes, but Posix doesn't require that and dash doesn't implement it.)

You can use the line without modification to print more than one symbol, as long as you provide the Unicode codepoints (as integer constants) for all of them (example executed in dash):

$ printf $(printf '\\%o' $(printf %08x 0x2388 0x266c 0xA |
>                          sed 's/../0x& /g')) |
> iconv -f UTF-32BE -t UTF-8
⎈♬
$

Note: As Geoff Nixon mentions in a comment, the fish shell (which is nowhere close to Posix standard, and as far as I can see has no aspirations to conform) will complain about the unquoted %08x format argument to printf, because it expects words starting with % to be jobspecs. So if you use fish, add quotes to the format argument.

edited Jun 26 '15 at 17:47

answered Dec 26 '14 at 15:37

rici

234,347
28
237
341

2

The [`command`](http://pubs.opengroup.org/onlinepubs/9699919799/utilities/command.html) command is designed for the purpose for which you're using `env`. – Jonathan Leffler Dec 26 '14 at 18:19
@JonathanLeffler: `command` avoids shell functions, not shell built-ins. (Posix: "If the command_name is the same as the name of one of the special built-in utilities, the special properties in the enumerated list at the beginning of Special Built-In Utilities shall not occur."; `man dash`: "Execute the specified command but ignore shell functions when searching for it.") – rici Dec 26 '14 at 19:21
1

OK — sort of. I see what you're driving at. Strictly, under the terms of reference for POSIX, `printf` is not a shell special built-in: the commands that are special built-ins are: `break`, `colon`, `continue`, `dot`, `eval`, `exec`, `exit`, `export`, `readonly`, `return`, `set`, `shift`, `times`, `trap`, `unset` (where `colon` and `dot` are spelled `:` and `.` respectively). What a shell should do with `printf` is 'not execute the built-in' — whether it actually executes something else is debatable. As an example, in Bash, `command cd` still manages to change the shell's directory. – Jonathan Leffler Dec 26 '14 at 19:30
1

@JonathanLeffler: Perhaps whether it *should* execute something else is debatable (although Posix's "In every other respect, if command_name is not the name of a function, the effect of command (with no options) shall be the same as omitting command" doesn't leave much wiggle room), but it is not debatable whether it *does* execute something else. For example, in `dash` on my ubuntu install, `env printf '\u2388\n'` and `command printf '\u2388\n'` in `dash` do not print the same thing, and I'm sure you can easily reproduce the experiment. – rici Dec 26 '14 at 19:34
This is a very well-thought-out answer. I've dropped an upvote on it, but I'm not going to accept it just yet. (Forgive me, the ‘heavyness’ makes me want to dream that somebody can come up with some sort of slick, cross-platform hack that's a bit more friendly.) … That said, this is probably what I'll end up using myself; and I'll come back and accept the answer later if nothing else pops up. <3 – ELLIOTTCABLE Jan 03 '15 at 22:58
This is a **really** good answer — @ELLIOTTCABLE, you really gotta `/usr/bin/printf '\342\234\205'` this answer. I probably just spent a good 10 hours trying to pin down these escapes/format strings until I found this. @rici, two things: 1. Although the question does mention a POSIX shell, you still should probably always single-quote the format string. Some shells (bash, fish) still have csh-style job control identifiers, so you might end up with `%08x: no such job` or something. 2. With the `iconv` I have (from Apple's GNU libiconv), it has to be UTF-32BE, not UTF32BE. – Geoff Nixon Jun 26 '15 at 17:19
And for what its worth, I also am pretty uncomfortable with `env printf`. Better to exec in an explicit subshell, either bare `(exec printf $(exec printf ...` or with function: `uniprintf()( exec printf $(exec printf '\\%o' $(printf '%08x' "$*" | ...` (note parens instead of brackets in the definition). – Geoff Nixon Jun 26 '15 at 17:23
@GeoffNixon: Thanks for the note about Apple libiconv; I did mention in the answer that names are not standardized, but I'll add the dash since it also works on more recent GNU iconv. Bash doesn't complain about % because it is not an expansion; it's only valid in a command which expects a jobspec. fish does complain, but fish is nowhere close to a Posix shell (and as far as I can see, doesn't have any desire to be). Why does env make you uncomfortable? It's Posix standard. – rici Jun 26 '15 at 17:39
1

@GeoffNixon: Also, in the long version `printf $(printf ...)`, there is no need to prevent the shell from using its builtin printf, and it is much less inefficient to let it do so. The only reason to use `env` in the first suggestion is to deal with shells like `dash` whose builtin printf's don't handle the unicode escape. – rici Jun 26 '15 at 17:49
@ELLIOTTCABLE I suppose you're right... I guess I've just been a bit wary of env since http://unix.stackexchange.com/questions/157329/what-does-env-x-command-bash-do-and-why-is-it-insecure?answertab=votes#tab-top Not the vulnerability necessarily just weird stuff like `export -f`. – Geoff Nixon Jun 26 '15 at 18:01
1

@GeoffNixon: That vulnerability has nothing to do with `env`. It is about bugs in `bash` as it reads the environment on startup. True, you can use `env` to plant a non-compliant string into the environment, but in this invocation it is obvious that that won't happen, and in any case there is no child `bash` process which could misinterpret the string. – rici Jun 26 '15 at 18:11

score -3 · Answer 2 · answered Dec 26 '14 at 07:12

-3

i would go with

echo -e "\xc3\xb6"

do check it:

~ $ echo -e "\xc3\xb6"
ö
~ $ echo -n ö | hexdump
0000000 b6c3                                   
0000002

answered Dec 26 '14 at 07:12

wgitscht

2,676
2
21
25

1

Note that, e.g., `\u2388 ≠ \x23\x88`. Rather, `\u2388 = \xe2\x8e\x88\x00`. – gniourf_gniourf Dec 26 '14 at 09:55
2

This doesn't work in `dash`. Also, it probably fails requirement (3): "isn't dependant on the particular encoding of the string (i.e. not by encoding UTF-8 bytes in octal)". (It requires encoding UTF-8 bytes, albeit in hex, so it obscures the original Unicode codepoint.) – rici Dec 26 '14 at 15:05
1

Also, honoring `-e` as anything other than a string to be printed violates the POSIX specification for `echo` (not "isn't guaranteed by", but actually "violates"). `-n` and strings with backslashes are implementation-defined barring XSI extensions, whereas `-e` isn't allowed at all. See http://pubs.opengroup.org/onlinepubs/009604599/utilities/echo.html -- including the APPLICATION USAGE section of the standard, which suggests using `printf` instead. – Charles Duffy Jun 26 '15 at 17:47

What's the best way to embed a Unicode character in a POSIX shell script?

2 Answers2