3

in a shell script i need to find out whether a specific application is still running or not. this would be a simple task to do if our application name would not contain any Umlauts (äöüàéè...). how can i reliably "grep" for my process in question?

the shell script gets the application name as parameter, "amétiq siMed Büro.app" in this example. There are several customized copies running at the same time, they are named differently, and the script should check only a specific application (the one it gets via param) and ignore the others.

no hits at all when using grep for the specific app-name (param):

bash> ps ax | grep "amétiq siMed Büro.app"

bash>

too many hits:

bash> ps ax | grep "/[A]pplications/am" 
 4335   ??  S      5:19.01 /Applications/ame?M^Atiq siMed Bu?M^Hro.app/Contents/MacOS/siMed2
10188   ??  S      0:03.18 /Applications/ame?M^Atiq siMed SUPPORT.app/Contents/MacOS/siMed2

again no hits when trying to manually narrow grep:

bash> ps ax | grep "/[A]pplications/am" | grep "Büro"

bash>

it seems that grep stops working after the position of the first occurrence of an Umlaut character.

i also tried lsof - no success. any idea what to try next?

running OS X 10.7-10.9

svenson
  • 805
  • 1
  • 10
  • 12
  • What does `locale` output in your bash shell? – mklement0 Apr 22 '14 at 14:09
  • it outputs this by default: LANG="de_CH.UTF-8" LC_COLLATE="de_CH.UTF-8" LC_CTYPE="de_CH.UTF-8" LC_MESSAGES="de_CH.UTF-8" LC_MONETARY="de_CH.UTF-8" LC_NUMERIC="de_CH.UTF-8" LC_TIME="de_CH.UTF-8" LC_ALL= – svenson Apr 22 '14 at 14:15

3 Answers3

9

tl;dr

  • Use pgrep instead of ps + grep
  • Use iconv -t UTF8-MAC to convert your search string to NFD (normalized decomposed Unicode) form.
pgrep -qlf "$(iconv -t UTF8-MAC <<<'amétiq siMed Büro.app')" && echo "RUNNING"

In a nutshell: the Mac filesystem (HFS+) stores filenames in decomposed Unicode form (NFD), whereas what you type into a shell is in composed Unicode form (NFC) and neither the shell nor the Unix utilities treat two equivalent strings - same content, different forms - as content-identical - even though they should.

If the gory details interest you, read on.


Background

Some accented Unicode characters have a composed form - a single code point representing the character directly (e.g. ü) - as well as an equivalent decomposed form - the base character followed by a combining diacritical character (e.g., u, followed by ¨); see https://en.wikipedia.org/wiki/Unicode_equivalence for more information.

Strings that contain only composed characters are in the NFC normal[ized] form (C for 'Composed'), whereas strings that only contain decomposed ones are in the NFD normal[ized] form (D for 'Decomposed').

The Mac filesystem (HFS+) stores filenames in NFD (DEcomposed), which has the following implications:

  • Applications launched via Finder and Spotlight are represented as NFD strings in the system's process table.

  • Similarly, in a shell (bash in Terminal.app), all of the following techniques yield NFD strings:

    • pathname expansion (e.g. echo *.app)
    • output from ls and similar utilities
    • interactive filename completion at the prompt
  • By contrast, if you type a script or application name in a shell (or copy a NFC form from elsewhere), it will be represented in NFC.

The crux of the problem: the shell and the Unix utilities do not recognize the equivalence of NFD and NFC forms and therefore treat them as different.

The - cumbersome and obscure - workaround is to only match NFD strings against NFD strings, and only NFC strings against NFC strings.

The insidious thing is that NFD and NFC forms of a given string look absolutely identical in the shell - as they should - but are treated differently.

To determine whether a given string is in NFD or NFC form, use, e.g.:

 cat -v <<<'amétiq siMed Büro.app'
  • If the string is in NFC, the output is the same as the input.
  • If the string is in NFD, if the output contains garbled characters; e.g., ame?M-^Atiq siMed Bu?M-^Hro.app (this, in fact, is what ps reports - though it shouldn't).

Alternatively, pipe to hexdump -C to see the individual byte values.

Note that the man remark about ps not correctly display argument lists containing multibyte characters is not true per se (at least as of OS X 10.9.2): NFC strings are correctly printed, whereas NFD ones are not. Contrast that with pgrep, which prints both NFC and NFD strings correctly, but doesn't recognize their equivalence when matching, as described.


Converting between NFC and NFD forms

  • To generically convert any string between NFD and NFC, use iconv with the UTF8-MAC encoding scheme.

The following examples use input string 'ü'

  • in NFC form, $'\xc3\xbc' - i.e., bytes 0xC3 0xBC, which is the UTF8 encoding of Unicode codepoint 0xFC
  • in NFD form, $'u\xcc\x88' - i.e., a u - the base character - followed by bytes 0xCC 0x88, which is the UTF8 encoding of Unicode codepoint 0x308, the so-called combining diaeresis (¨).

to demonstrate converting; note that in Terminal the result will always appear as ü - pipe to hexdump -C, for instance, to see the byte values.

  # NFC -> NFD
iconv -t UTF8-MAC <<<$'\xc3\xbc' # -> $'u\xcc\x88'

  # NFD -> NFC
iconv -f UTF8-MAC <<<$'u\xcc\x88' # -> $'\xc3\xbc'

These conversions are safe to use in that if the input string is already in the target format, it is left as is.

  • To get a reusable ANSI-C-quoted form of a string - whether NFC or NFD - you can use the bash shell function quoteNonAscii listed below; in the case at hand, to get a representation of the application name in NFD form:
    • cd to /Applications (or wherever your application lives)
    • Run quoteNonAscii am*tiq*siMed*B*ro.app - pathname expansion will ensure that the glob expands to the NFD form of the filename.
# Pass any string to this function to output 
# an ANSI-C-quoted string with all non-ASCII bytes represented
# as \x{nn} hex. codes; trailing newlines are always trimmed.
# Examples:
#    quoteNonAscii 'ü'   # (if NFC) -> $'\xc3\xbc'
#    quoteNonAscii 'ü'  # (if NFD) -> $'u\xcc\x88'
quoteNonAscii() {
  hexdump -ve '/1 "%02x "' <<<"$*" | 
    awk -v RS=' '  '
      BEGIN { printf "$\x27" }                # print the opening of the ANSI-C-quoted string, `${single quote}`
      $1=="0a" { nls=nls "\x5cn"; next }      # store consecutive newlines in a temp. variable
      nls      { printf "%s", nls; nls="" }   # a non-newline char; we now know that the newlines stored so far are NOT trailing, so we print them and clear the temp. variable.
      $1>"7f"  { printf "\\x" $1; next }      # a non-ASCII byte -> PRINT AS `\xnn`
      $1=="22" { printf "\x5c\x22"; next }    # a double-quote char. -> escape with `\`
      $1=="27" { printf "\x5c\x27"; next }    # a single-quote char. -> escape with `\`
      $1=="07"  { printf "\\a"; next }        # bell char.
      $1=="08"  { printf "\\b"; next }        # backspace
      $1=="09"  { printf "\\t"; next }        # tab
      $1=="0b"  { printf "\\v"; next }        # vertical tab
      $1=="0c"  { printf "\\f"; next }        # ff
      $1=="0d"  { printf "\\r"; next }        # CR
      $1=="1b"  { printf "\\e"; next }        # escape
      { system("printf %b \"\\x" $1 "\"") }   # a byte that is an ASCII char -> print as a CHAR.
      END { print "\x27"}'                    # print the closing `{single quote}` of the ANSI-C-quoted string.  
}

Locales in macOS:

Note: This is a revised remnant from the original answer, which hopefully still contains useful information.

  • Running locale in an interactive shell tells you what locale is in effect, reflected in the following environment variables: LANG, LC_COLLATE, LC_CTYPE, LC_MESSAGES, LC_MONETARY, LC_NUMERIC, LC_TIME. For instance, if the US English locale is in effect, you'd see:
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=
  • By default, Terminal.app and other terminal programs such as iTerm by default preconfigure the locale for shells to match the user's locale as specified via System Preferences > Language & Region (in Terminal.app you can turn this behavior off via Preferences... > Settings > {Your Profile} > Advanced, check box Set locale environment variables on startup).

    • The character encoding - reflected in the .{encoding} suffix in the locale ID, typically .UTF8 - will match the encoding configured in the terminal program's settings (for Terminal.app, go to Preferences... > Settings > {Your Profile} > Advanced and change the Character encoding setting), if supported (use locale -a to see all supported language/region + encoding combinations).

    • Both Terminal and iTerm default to UTF-8, which is a sensible choice.

    • If your terminal program is configured to use an unsupported character encoding, the locale ID reported will have NO encoding suffix (e.g., just en_US) in Terminal and revert to the "C" locale altogether in iTerm - and things will likely NOT work properly (Terminal will still let you print non-ASCII characters from that encoding, but the utilities won't recognize them as characters, resulting in illegal byte sequence errors).

    • Similarly, if your configure an unsupported combination of primary language and geographic region in System Preferences (e.g., combining "German" (de) with "United States" (US), which results in unsupported locale de_US), only LC_TYPE will be matched to your terminal program's encoding, and the other LC_* categories will default to "C".

  • In case you need to set a locale manually, run:

  • export LANG={localeId} or

  • export LC_ALL={localeId}

The difference is that export LANG=... provides a default for all LC_* categories while allowing you to selectively override them, whereas export LC_ALL=... overrides all LC_* categories.
Supported locale IDs can be listed with locale -a; it's best to choose one that is UTF-8-based, e.g., de_CH.UTF-8.
The POSIX locale - essentially an ASCII-only, US-English locale - can be selected either via "POSIX" or "C".

  • Caveat: ALL Unix utilities that come with macOS suffer the problem described above: they do not recognize equivalent Unicode strings in NFC and NFD as identical. Aside from this issue, many, but not all Unix utilities are UTF8 multi-byte-character-aware in principle.
    • A notable exception as of macOS 10.14 - i.e., a utility that is not UTF8-aware at all - is awk; in earlier macOS versions sort wasn't UTF8-aware either (this changed when the previously used obsolete GNU implementation was replaced with a recent BSD implementation).
mklement0
  • 382,024
  • 64
  • 607
  • 775
  • 1
    @svenson: Thanks; I've since found the crux of the problem and have heavily revised my answer - it's a sordid tale. Note that your AppleScript based solution only _prints_ the path in recognizable form - if you wanted to _compare_ it to a string, you'd run into the NFC/NFD problem again. The equivalent of your AppleScript solution - suffering the same limitation - is `pgrep -lf 'siMed'`. – mklement0 Apr 25 '14 at 20:41
0

You have to setup your locale settings to match the accents, example:

$ export LC_ALL="en_US.UTF-8"
$ echo "amétiq siMed Büro.app" | grep ü

NO result

$ export LC_ALL="en_US"                                                                      
$ echo "amétiq siMed Büro.app" | grep ü
amétiq siMed Büro.app

ps example:

$ export LC_ALL="en_US"
$ tail -f ü.k &
[1] 57945
$ ps -ef | grep ü[.]
klashxx   57945 27535  0 15:02 pts/6    00:00:00 tail -f ü.k
Juan Diego Godoy Robles
  • 14,447
  • 2
  • 38
  • 52
  • this works fine with echo, but it returns no result when combining with ps or lsof, they seem to produce different output than echo. i need to filter running processes, are there maybe alternatives to using ps or lsof? – svenson Apr 22 '14 at 12:52
  • +1 for the answer and to avoid accidental metcharacters in pattern use `grep \`printf '%q' 'pattern'\`` – PradyJord Apr 22 '14 at 12:54
  • using /bin/ps and /usr/sbin/lsof (standard install OS X 10.9.2) – svenson Apr 22 '14 at 12:54
  • no success for me... `bash> jobs bash> export LC_ALL="en_US" ; touch "/tmp/amétiq siMed Büro.log" ; tail -f "/tmp/amétiq siMed Büro.log" & [1] 11066 bash> ps ax | grep ü[.] bash> ps -ef | grep ü[.] bash> jobs [1]+ Running tail -f "/tmp/amétiq siMed Büro.log" & bash> ` – svenson Apr 22 '14 at 13:11
  • sorry for the oneliner, apparently in a comment you cannot use newlines... (each `bash>` is supposed to start a new line) – svenson Apr 22 '14 at 13:13
  • @Jord i assume you mean something like `grep "\`printf '%q' 'amétiq siMed Büro.app'\`"`, but this returns no result for me. – svenson Apr 22 '14 at 13:22
  • @svenson Please use: `ps -ef | recode utf8..utf16 | grep \`printf '%q' 'amétiq siMed Büro.app'\`` See if this works – PradyJord Apr 22 '14 at 13:25
  • @Jord recode is not part of the standard OS X installation, do you know which command is the OS X equivalent? the code needs to run without any additional installation of tools/binaries. – svenson Apr 22 '14 at 13:29
  • 1
    At least the part about matching your locale settings is wrong. An UTF-8 locale should have everything. If I test with en_US.UTF-8 specifically I have no problems matching umlauts from ps output. – Andreas Bombe Apr 22 '14 at 13:41
  • (copied my answer from below) found the following statement in `man ps`: `The ps utility does not correctly display argument lists containing multibyte characters`. i guess that's where my problem is coming from... which again raises the question: are there other ways to filter running processes than using ps or lsof? – svenson Apr 22 '14 at 14:33
0

it seems i was too quick in solving my problem using osascript/AppleScript - i was able to filter my process in question in the terminal, but for some reason it didn't work in my script...

so here's what i found to work around the problem: if i cannot reliably "grep" the application path using commands like ps, lsof, ... matching the path my script gets as param, then i simply need to re-generate it with the help of a new process.

again, my problem in short:

my script gets an application path as parameter. this path contains umlauts. furthermore, there are several variants of the application, named differently, several of them might be running at the same time, but the script needs to filter exactly the one it gets as param.

/Applications/amétiq siMed Büro.app/Contents/MacOS/siMed2

using ps, lsof etc. i get garbled output, no matter what locale i had set, it never matched my param:

bash> ps ax | grep "/[A]pplications/am"
70202   ??  S      1:56.38 /Applications/ame?M^Atiq siMed Bu?M^Hro.app/Contents/MacOS/siMed2
75164   ??  U      0:01.75 /Applications/ame?M^Atiq siMed MASTER SN.app/Contents/MacOS/siMed2

grep fails as soon as there's an Umlaut involved in the string:

bash> ps ax | grep "/[A]pplications/amétiq siMed Büro.app"
(empty result)

my solution is to start a "tail &" process on a file existing in the application package, then do a bit of ps, cut and awk, to get the pid of the application i am looking for:

cd "/Applications/amétiq siMed Büro.app"  # path the script gets as param
tail -f ./Contents/MacOS/helperfile.txt &
helperpid=$!  # pid of tail process
gr="`lsof -p $helperpid | cut -d'/' -f 2- | grep '/Contents/MacOS/' | sed 's:/Contents/MacOS.*$::' | head -1`"
kill $helperpid  # helper process no longer needed
finalpid=`lsof | grep "$gr" | grep "app/Contents/MacOS" | awk '{print $2}'`
# $finalpid contains the pid of the process in question

please note that i had to set LC_ALL and LANG to "en_US.UTF-8" (possibly setting one of them might not be required, i did not dig into this any further...).

i know this is only a workaround, it would be much nicer to have a oneliner... at least this solution does the trick for me. thanks again for anyone involved in the discussion of this problem!

svenson
  • 805
  • 1
  • 10
  • 12
  • 1
    please have a look at the awesome answer of mklement0 about NFC and NFD strings (composed/decomposed characters), this really explains my problem very detailed. – svenson Apr 28 '14 at 09:20
  • Thanks for the plug; I think I've found a generic solution, using `iconv` with `UTF8-MAC`; see the top of my updated answer. – mklement0 May 11 '14 at 15:07