10

A POSIX compliant shell shall provide mechanisms like this to iterate over collections of strings:

for x in $(seq 1 5); do
    echo $x
done

But, how do I iterate over each character of a word?

codeforester
  • 39,467
  • 16
  • 112
  • 140
Luis Lavaire.
  • 599
  • 5
  • 17
  • 1
    (As an aside, `seq` isn't POSIX-specified; one mechanism for counting to 5 POSIX-ly might be `i=0; while [ "$i" -lt 5 ]; do echo "$i"; i=$((i + 1)); done`) – Charles Duffy Jun 26 '18 at 23:38
  • I was trying to demonstrate how to perform an iteration. The `seq` command is not part of the mechanism that performs the iteration. But you are right about your example being POSIX compliant. – Luis Lavaire. Jun 26 '18 at 23:54

4 Answers4

10

It's a little circuitous, but I think this'll work in any posix-compliant shell. I've tried it in dash, but I don't have busybox handy to test with.

var='ab * cd'

tmp="$var"    # The loop will consume the variable, so make a temp copy first
while [ -n "$tmp" ]; do
    rest="${tmp#?}"    # All but the first character of the string
    first="${tmp%"$rest"}"    # Remove $rest, and you're left with the first character
    echo "$first"
    tmp="$rest"
done

Output:

a
b

*

c
d

Note that the double-quotes around the right-hand side of assignments are not needed; I just prefer to use double-quotes around all expansions rather than trying to keep track of where it's safe to leave them off. On the other hand, the double-quotes in [ -n "$tmp" ] are absolutely necessary, and the inner double-quotes in first="${tmp%"$rest"}" are needed if the string contains "*".

Gordon Davisson
  • 118,432
  • 16
  • 123
  • 151
  • Seems to choke on multibyte characters in busybox sh and dash, but not in bash or ksh. – that other guy Jun 27 '18 at 00:33
  • `LC_CTYPE=C` to the rescue? – Charles Duffy Jun 27 '18 at 00:49
  • dash under Ubuntu seems to process the string byte-by-byte (rather than character-by-character) no matter what `LC_CTYPE` is set to; I'm guessing that ash under busybox does the same. If you need proper character-by-character processing, this'll make things more complicated. – Gordon Davisson Jun 27 '18 at 01:19
  • I think that for ASCII characters it will be fine. For unicode, I don't think so. – Luis Lavaire. Jun 28 '18 at 17:45
  • @LuisLavaire Do you need proper Unicode support for your application? If so, I don't think there's an easy way to build it in a shell that doesn't do Unicode itself; it'd depend on using some utility that's available and properly supports Unicode across every platform you need it to work on. – Gordon Davisson Jun 28 '18 at 19:38
  • No, @GordonDavisson. I just need to manipulate ASCII characters, mostly letters, so your example will work. Thanks. – Luis Lavaire. Jun 29 '18 at 19:54
  • `first="${tmp%"$rest"}"` looks slow. alternative: `first="$(echo "$tmp" | head -c1)"` – milahu Sep 06 '22 at 10:33
  • 1
    @milahu That alternative will actually be a lot slower, because it needs to create new subprocesses to run the `echo` and `head` programs, and creating processes is computationally expensive. `first="${tmp%"$rest"}"` does everything in the shell process, so it actually winds up much faster. Try them and see for yourself! – Gordon Davisson Sep 06 '22 at 17:01
3

Use getopts to process input one character at a time. The : instructs getopts to ignore illegal options and set OPTARG. The leading - in the input makes getopts treat the string as a options.

If getopts encounters a colon, it will not set OPTARG, so the script uses parameter expansion to return : when OPTARG is not set/null.

#!/bin/sh
IFS='
'
split_string () {
  OPTIND=1;
  while getopts ":" opt "-$1"
    do echo "'${OPTARG:-:}'"
  done
}

while read -r line;do
  split_string "$line"
done

As with the accepted answer, this processes strings byte-wise instead of character-wise, corrupting multibyte codepoints. The trick is to detect multibyte codepoints, concatenate their bytes and then print them:

#!/bin/sh
IFS='
'
split_string () {
  OPTIND=1;
  while getopts ":" opt "$1";do
    case "${OPTARG:=:}" in
      ([[:print:]])
        [ -n "$multi" ] && echo "$multi" && multi=
        echo "$OPTARG" && continue
    esac
    multi="$multi$OPTARG"
    case "$multi" in
      ([[:print:]]) echo "$multi" && multi=
    esac
  done
  [ -n "$multi" ] && echo "$multi"
}
while read -r line;do
  split_string "-$line"
done

Here the extra case "$multi" is used to detect when the multi buffer contains a printable character. This works on shells like Bash and Zsh but Dash and busybox ash do not pattern match multibyte codepoints, ignoring locale.

This degrades somewhat nicely: Dash/ash treat sequences of multibyte codepoints as one character, but handle multibyte characters surrounded by single byte characters fine.

Depending on your requirements it may be preferable not to split consecutive multibyte codepoints anyway, as the next codepoint may be a combining character which modifies the character before it.

This won't handle the case where a single byte character is followed by a combining character.

David Farrell
  • 427
  • 6
  • 16
  • 1
    Interesting. It may improve performance problems with large strings. I don't know if it will work, but I've tried to improve on your idea. `split_string() { OPTIND=1; while getopts ":" opt "-$1"; do echo "'$OPTARG'"; done; }` – Koichi Nakashima Nov 10 '21 at 17:22
  • @KoichiNakashima very nice! On my laptop it's 1.4x faster at parsing 312Kb text file, and 55x faster than using `sed` – David Farrell Nov 11 '21 at 00:51
2

This works in dash and busybox:

echo 'ab * cd' | grep -o .

Output:

a
b

*

c
d
agc
  • 7,973
  • 2
  • 29
  • 50
-1

I was developing a script which demanded stacks... So, we can use it to iterate through strings

#!/bin/sh
# posix script

pop () {
#    $1 top
#    $2 stack
    eval $1='$(expr "'\$$2'" : "\(.\).*")'
    eval $2='$(expr "'\$$2'" : ".\(.*\)" )'
}

string="ABCDEFG"
while [ "$string" != "" ]
do
    pop c string
    echo "--" $c
done
macemurez
  • 29
  • 5