tl;dr
- Use
pgrep
instead of ps
+ grep
- Use
iconv -t UTF8-MAC
to convert your search string to NFD (normalized decomposed Unicode) form.
pgrep -qlf "$(iconv -t UTF8-MAC <<<'amétiq siMed Büro.app')" && echo "RUNNING"
In a nutshell: the Mac filesystem (HFS+) stores filenames in decomposed Unicode form (NFD), whereas what you type into a shell is in composed Unicode form (NFC) and neither the shell nor the Unix utilities treat two equivalent strings - same content, different forms - as content-identical - even though they should.
If the gory details interest you, read on.
Background
Some accented Unicode characters have a composed form - a single code point representing the character directly (e.g. ü
) - as well as an equivalent decomposed form - the base character followed by a combining diacritical character (e.g., u
, followed by ¨
); see https://en.wikipedia.org/wiki/Unicode_equivalence for more information.
Strings that contain only composed characters are in the NFC normal[ized] form (C for 'Composed'), whereas strings that only contain decomposed ones are in the NFD normal[ized] form (D for 'Decomposed').
The Mac filesystem (HFS+) stores filenames in NFD (DEcomposed), which has the following implications:
Applications launched via Finder and Spotlight are represented as NFD strings in the system's process table.
Similarly, in a shell (bash in Terminal.app), all of the following techniques yield NFD strings:
- pathname expansion (e.g.
echo *.app
)
- output from
ls
and similar utilities
- interactive filename completion at the prompt
By contrast, if you type a script or application name in a shell (or copy a NFC form from elsewhere), it will be represented in NFC.
The crux of the problem: the shell and the Unix utilities do not recognize the equivalence of NFD and NFC forms and therefore treat them as different.
The - cumbersome and obscure - workaround is to only match NFD strings against NFD strings, and only NFC strings against NFC strings.
The insidious thing is that NFD and NFC forms of a given string look absolutely identical in the shell - as they should - but are treated differently.
To determine whether a given string is in NFD or NFC form, use, e.g.:
cat -v <<<'amétiq siMed Büro.app'
- If the string is in NFC, the output is the same as the input.
- If the string is in NFD, if the output contains garbled characters; e.g.,
ame?M-^Atiq siMed Bu?M-^Hro.app
(this, in fact, is what ps
reports - though it shouldn't).
Alternatively, pipe to hexdump -C
to see the individual byte values.
Note that the man
remark about ps
not correctly display argument lists containing multibyte characters is not true per se (at least as of OS X 10.9.2): NFC strings are correctly printed, whereas NFD ones are not.
Contrast that with pgrep
, which prints both NFC and NFD strings correctly, but doesn't recognize their equivalence when matching, as described.
Converting between NFC and NFD forms
- To generically convert any string between NFD and NFC, use
iconv
with the UTF8-MAC
encoding scheme.
The following examples use input string 'ü'
- in NFC form,
$'\xc3\xbc'
- i.e., bytes 0xC3 0xBC
, which is the UTF8 encoding of Unicode codepoint 0xFC
- in NFD form,
$'u\xcc\x88'
- i.e., a u
- the base character - followed by bytes 0xCC 0x88
, which is the UTF8 encoding of Unicode codepoint 0x308
, the so-called combining diaeresis (¨
).
to demonstrate converting; note that in Terminal the result will always appear as ü
- pipe to hexdump -C
, for instance, to see the byte values.
# NFC -> NFD
iconv -t UTF8-MAC <<<$'\xc3\xbc' # -> $'u\xcc\x88'
# NFD -> NFC
iconv -f UTF8-MAC <<<$'u\xcc\x88' # -> $'\xc3\xbc'
These conversions are safe to use in that if the input string is already in the target format, it is left as is.
- To get a reusable ANSI-C-quoted form of a string - whether NFC or NFD - you can use the
bash
shell function quoteNonAscii
listed below; in the case at hand, to get a representation of the application name in NFD form:
cd
to /Applications
(or wherever your application lives)
- Run
quoteNonAscii am*tiq*siMed*B*ro.app
- pathname expansion will ensure that the glob expands to the NFD form of the filename.
# Pass any string to this function to output
# an ANSI-C-quoted string with all non-ASCII bytes represented
# as \x{nn} hex. codes; trailing newlines are always trimmed.
# Examples:
# quoteNonAscii 'ü' # (if NFC) -> $'\xc3\xbc'
# quoteNonAscii 'ü' # (if NFD) -> $'u\xcc\x88'
quoteNonAscii() {
hexdump -ve '/1 "%02x "' <<<"$*" |
awk -v RS=' ' '
BEGIN { printf "$\x27" } # print the opening of the ANSI-C-quoted string, `${single quote}`
$1=="0a" { nls=nls "\x5cn"; next } # store consecutive newlines in a temp. variable
nls { printf "%s", nls; nls="" } # a non-newline char; we now know that the newlines stored so far are NOT trailing, so we print them and clear the temp. variable.
$1>"7f" { printf "\\x" $1; next } # a non-ASCII byte -> PRINT AS `\xnn`
$1=="22" { printf "\x5c\x22"; next } # a double-quote char. -> escape with `\`
$1=="27" { printf "\x5c\x27"; next } # a single-quote char. -> escape with `\`
$1=="07" { printf "\\a"; next } # bell char.
$1=="08" { printf "\\b"; next } # backspace
$1=="09" { printf "\\t"; next } # tab
$1=="0b" { printf "\\v"; next } # vertical tab
$1=="0c" { printf "\\f"; next } # ff
$1=="0d" { printf "\\r"; next } # CR
$1=="1b" { printf "\\e"; next } # escape
{ system("printf %b \"\\x" $1 "\"") } # a byte that is an ASCII char -> print as a CHAR.
END { print "\x27"}' # print the closing `{single quote}` of the ANSI-C-quoted string.
}
Locales in macOS:
Note: This is a revised remnant from the original answer, which hopefully still contains useful information.
- Running
locale
in an interactive shell tells you what locale is in effect, reflected in the following environment variables: LANG
, LC_COLLATE
, LC_CTYPE
, LC_MESSAGES
, LC_MONETARY
, LC_NUMERIC
, LC_TIME
. For instance, if the US English locale is in effect, you'd see:
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=
By default, Terminal.app
and other terminal programs such as iTerm
by default preconfigure the locale for shells to match the user's locale as specified via System Preferences > Language & Region
(in Terminal.app
you can turn this behavior off via Preferences... > Settings > {Your Profile} > Advanced
, check box Set locale environment variables on startup
).
The character encoding - reflected in the .{encoding}
suffix in the locale ID, typically .UTF8
- will match the encoding configured in the terminal program's settings (for Terminal.app
, go to Preferences... > Settings > {Your Profile} > Advanced
and change the Character encoding
setting), if supported (use locale -a
to see all supported language/region + encoding combinations).
Both Terminal
and iTerm
default to UTF-8, which is a sensible choice.
If your terminal program is configured to use an unsupported character encoding, the locale ID reported will have NO encoding suffix (e.g., just en_US
) in Terminal
and revert to the "C"
locale altogether in iTerm
- and things will likely NOT work properly (Terminal
will still let you print non-ASCII characters from that encoding, but the utilities won't recognize them as characters, resulting in illegal byte sequence
errors).
Similarly, if your configure an unsupported combination of primary language and geographic region in System Preferences
(e.g., combining "German" (de
) with "United States" (US
), which results in unsupported locale de_US
), only LC_TYPE
will be matched to your terminal program's encoding, and the other LC_*
categories will default to "C"
.
In case you need to set a locale manually, run:
export LANG={localeId}
or
export LC_ALL={localeId}
The difference is that export LANG=...
provides a default for all LC_*
categories while allowing you to selectively override them, whereas export LC_ALL=...
overrides all LC_*
categories.
Supported locale IDs can be listed with locale -a
; it's best to choose one that is UTF-8-based, e.g., de_CH.UTF-8
.
The POSIX locale - essentially an ASCII-only, US-English locale - can be selected either via "POSIX"
or "C"
.
- Caveat: ALL Unix utilities that come with macOS suffer the problem described above: they do not recognize equivalent Unicode strings in NFC and NFD as identical.
Aside from this issue, many, but not all Unix utilities are UTF8 multi-byte-character-aware in principle.
- A notable exception as of macOS 10.14 - i.e., a utility that is not UTF8-aware at all - is
awk
; in earlier macOS versions sort
wasn't UTF8-aware either (this changed when the previously used obsolete GNU implementation was replaced with a recent BSD implementation).