3

I want printf to recognize multi-byte characters when calculating the field width so that columns line up properly... I can't find an answer to this problem and was wondering if anyone here had any suggestions, or maybe a function/script that takes care of this problem.

Here's a quick and dirty example:

printf "## %5s %5s %5s ##\n## %5s %5s %5s ##\n" '' '*' '' '' "•" ''
>##           *       ##
>##         •       ##

Obviously, I want the result:

>##           *       ##
>##           •       ##

Any way to achieve this?

Aesthir
  • 347
  • 5
  • 12

7 Answers7

5

The best I can think of is:

function formatwidth
{
  local STR=$1; shift
  local WIDTH=$1; shift
  local BYTEWIDTH=$( echo -n "$STR" | wc -c )
  local CHARWIDTH=$( echo -n "$STR" | wc -m )
  echo $(( $WIDTH + $BYTEWIDTH - $CHARWIDTH ))
}

printf "## %5s %*s %5s ##\n## %5s %*s %5s ##\n" \
    '' $( formatwidth "*" 5 ) '*' '' \
    '' $( formatwidth "•" 5 ) "•" ''

You use the * width specifier to take the width as an argument, and calculate the width you need by adding the number of additional bytes in multibyte characters.

Note that in GNU wc, -c returns bytes, and -m returns (possibly multibyte) characters.

ninjalj
  • 42,493
  • 9
  • 106
  • 148
2

I will probably use GNU awk:

awk 'BEGIN{ printf "## %5s %5s %5s ##\n## %5s %5s %5s ##\n", "", "*", "", "", "•", "" }'
##           *       ##
##           •       ##

You can even write shell wrapper function called printf on top of awk to keep same interface:

tr2awk() { 
    FMT="$1"
    echo -n "gawk 'BEGIN{ printf \"$FMT\""
    shift
    for ARG in "$@"
        do echo -n ", \"$ARG\""
    done
    echo " }'"
}

and then override printf with simple function:

printf() { eval `tr2awk "$@"`; }

Test it:

# buggy printf binary test:
/usr/bin/printf "## %5s %5s %5s ##\n## %5s %5s %5s ##\n" '' '*' '' '' "•" ''
##           *       ##
##         •       ##
# buggy printf shell builin test:
builtin printf "## %5s %5s %5s ##\n## %5s %5s %5s ##\n" '' '*' '' '' "•" ''
##           *       ##
##         •       ##

# fixed printf function test:
printf "## %5s %5s %5s ##\n## %5s %5s %5s ##\n" '' '*' '' '' "•" ''
##           *       ##
##           •       ##
Michał Šrajer
  • 30,364
  • 7
  • 62
  • 85
  • Hmmm... I tried this and get exactly the same result. Seems awk has the same problem. --edit-- Sorry, didn't read your reponse properly... using `gawk` works with this simple example, but awk has problems with quoting... suppose I wanted to write a script that is suitable for any `printf` line with quotes of all types interspersed. How would I keep my original single quotes in `awk`? or add more anywhere in the command? – Aesthir Jul 31 '11 at 04:32
  • The above works if you triple up the backslashes: `printf "## %5s '%5s' %5s ##\\\n## %5s '%5s' %5s ##\\\n" '' '*' '' '' "•" ''` But what if I wanted single quotes in there? like this: `printf "## %5s '%5s' %5s ##\\\n## %5s '%5s' %5s ##\\\n" '' '*' '' '' "•" ''`? How to use gawk and still print the single quotes? – Aesthir Jul 31 '11 at 17:32
  • tr2awk is vulnerable to injections though both $FMT and $ARG since their values are not escaped properly, and the result is fed into an eval. – OLEGSHA Aug 26 '22 at 14:12
2

A language like python will probably solve your problems in a simpler, more controllable way...

#!/usr/bin/python
# coding=utf-8

import sys
import codecs
import unicodedata

out = codecs.getwriter('utf-8')(sys.stdout)

def width(string):
    return sum(1+(unicodedata.east_asian_width(c) in "WF")
        for c in string)

a1=[u'する', u'します', u'trazan', u'した', u'しました']
a2=[u'dipsy', u'laa-laa', u'banarne', u'po', u'tinky winky']

for i,j in zip(a1,a2):
    out.write('%s %s: %s\n' % (i, ' '*(12-width(i)), j))
Fredrik Pihl
  • 44,604
  • 7
  • 83
  • 130
  • The results of your example looks perfect! But I don't know the first thing about python... I wrote an **`fprintf` script above that handles arguments _exactly_ like printf**, but works with multi-byte characters... could you or someone else whip up a python script to do the same? It may be faster than mine... but so far mine is the only way I know. – Aesthir Jul 31 '11 at 17:42
0

This is kind of late, but I just came across this, and thought I would post it for others coming across the same post. A variation to @ninjalj's answer might be to create a function that returns a string of a given length rather than calculate the required format length:

#!/bin/bash
function sized_string
{
        STR=$1; WIDTH=$2
        local BYTEWIDTH=$( echo -n "$STR" | wc -c )
        local CHARWIDTH=$( echo -n "$STR" | wc -m )
        FMT_WIDTH=$(( $WIDTH + $BYTEWIDTH - $CHARWIDTH ))
        printf "%*s" $FMT_WIDTH $STR
}
printf "[%s]\n" "$(sized_string "abc" 20)"
printf "[%s]\n" "$(sized_string "ab•cd" 20)"

which outputs:

[                 abc]
[               ab•cd]
HardcoreHenry
  • 5,909
  • 2
  • 19
  • 44
0

A pure shell solution

right_justify() {
        # parameters: field_width string
        local spaces questions
        spaces=''
        questions=''
        while [ "${#questions}" -lt "$1" ]; do
                spaces=$spaces" "
                questions=$questions?
        done
        result=$spaces$2
        result=${result#"${result%$questions}"}
}

Note that this still does not work in dash because dash has no locale support.

jilles
  • 10,509
  • 2
  • 26
  • 39
  • How would I call this function? what are the arguments I would use? and how does it help giving the `printf` commmand unicode support? – Aesthir Jul 31 '11 at 18:33
  • For example, a call `right_justify 10 abc` writes into the variable `result` a string of 10 characters containing 7 spaces and `abc` at the end (if `abc` were more than 10 characters, the last 10 characters of it). Although returning the result via a global variable is a little ugly, it is much faster than writing it to stdout and calling the function via command substitution. – jilles Aug 06 '11 at 21:38
0

Here's another solution with (g)awk:

function multibyte_printf {
    begin_rule='BEGIN { printf'
    vars=()
    
    for (( arg_index=1; arg_index<=$#; arg_index++ )); do
        begin_rule+=" arg${arg_index},"
        arg="${!arg_index}"
        vars+=('-v' "arg${arg_index}=${arg}")
    done
    
    # Remove last ','
    begin_rule="${begin_rule:0:${#begin_rule}-1}"
    begin_rule+=' }'
    
    gawk "${vars[@]}" "$begin_rule"
}

It generates and executes commands like this:

gawk -v 'arg1=%10s' -v 'arg2=World' 'BEGIN { printf arg1, arg2 }'

The main advantage of this solution over @Michał Šrajer's is improved security. Using awk variables instead of baking parameters into the rule code eliminates the need to escape special characters. It should be impossible to tamper with execution using malformed arguments.

OLEGSHA
  • 388
  • 3
  • 13
-1

Are these the only way? There's no way to do it with printf alone?

Well with the example from ninjalj (thx btw), I wrote a script to deal with this problem, and saved it as fprintf in /usr/local/bin:

#! /bin/bash

IFS=' '
declare -a Text=("${@}")

## Skip the whole thing if there are no multi-byte characters ##
if (( $(echo "${Text[*]}" | wc -c) > $(echo "${Text[*]}" | wc -m) )); then
    if echo "${Text[*]}" | grep -Eq '%[#0 +-]?[0-9]+(\.[0-9]+)?[sb]'; then
        IFS=$'\n'
        declare -a FormatStrings=($(echo -n "${Text[0]}" | grep -Eo '%[^%]*?[bs]'))
        IFS=$' \t\n'
        declare -i format=0

    ## Check every format string ##
        for fw in "${FormatStrings[@]}"; do
            (( format++ ))
            if [[ "$fw" =~ ^%[#0\ +-]?[1-9][0-9]*(\.[1-9][0-9]*)?[sb]$ ]]; then
                (( Difference = $(echo "${Text[format]}" | wc -c) - $(echo "${Text[format]}" | wc -m) ))

            ## If multi-btye characters ##
                if (( Difference > 0 )); then

                ## If a field width is entered then replace field width value ##
                    if [[ "$fw" =~ ^%[#0\ +-]?[1-9][0-9]* ]]; then
                        (( Width = $(echo -n "$fw" | gsed -re 's|^%[#0 +-]?([1-9][0-9]*).*[bs]|\1|') + Difference ))
                        declare -a Text[0]="$(echo -n "${Text[0]}" | gsed -rne '1h;1!H;${g;y|\n|\x1C|;s|(%[^%])|\n\1|g;p}' | gsed -rne $(( format + 1 ))'s|^(%[#0 +-]?)[1-9][0-9]*|\1'${Width}'|;1h;1!H;${g;s|\n||g;y|\x1C|\n|;p}')"
                    fi

                ## If a precision is entered then replace precision value ##
                    if [[ "$fw" =~ \.[1-9][0-9]*[sb]$ ]]; then
                        (( Precision = $(echo -n "$fw" | gsed -re 's|^%.*\.([1-9][0-9]*)[sb]$|\1|') + Difference ))
                        declare -a Text[0]="$(echo -n "${Text[0]}" | gsed -rne '1h;1!H;${g;y|\n|\x1C|;s|(%[^%])|\n\1|g;p}' | gsed -rne $(( format + 1 ))'s|^(%[#0 +-]?([1-9][0-9]*)?)\.[1-9][0-9]*([bs])|\1.'${Precision}'\3|;1h;1!H;${g;s|\n||g;y|\x1C|\n|;p}')"
                    fi
                fi
            fi
        done
    fi
fi

printf "${Text[@]}"
exit 0

Usage: fprintf "## %5s %5s %5s ##\n## %5s %5s %5s ##\n" '' '*' '' '' '•' ''

A few things to note:

  • I didn't write this script to deal with * (asterisk) values for formats because I never use them. I wrote this for me and didn't want to over-complicate things.
  • I wrote this to check only the format strings %s and %b as they seem to be the only ones that are affected by this problem. Thus, if somehow someone manages to get a multi-byte unicode character out of a number, it may not work without minor modification.
  • The script works great for basic use of printf (not some old-skooler UNIX hacker), feel free to modify, or use as is all!
tripleee
  • 175,061
  • 34
  • 275
  • 318
Aesthir
  • 347
  • 5
  • 12