2

I'm trying to align pieces of text to a fixed number of columns. The text is meant for logging purposes and may contain some user data, so I cannot assume anything about the input. For ease of viewing I want to make sure that, when viewed in a standard LINUX console this text is only a certain fixed number of characters wide.

PHP does not have a multibyte equivalent of its wordwrap-function. Several frameworks have their own version of it. I've been trying out several multibyte functions, amongst these the answers in this question in order to encode some UTF-8 text for unicode display. Most of the answers seem to use 'mb_strlen' to calculate substring lengths.

To ensure that the file I wanted to display is recognized as UTF-8, I prepended a byte order mark to the text. As far as I know, this 'should' cause linux to recognize and format correctly.

Then, I tried to encode some long strings. Strings such as 'ëëëëëëëë...' will correctly get cut off after the specified N character positions. However, a string such as '₩₩₩₩₩₩₩₩₩...' will not. It contains some characters that are double length in the UNIX console and will be cut off later than it is supposed to. In fact, it seems exactly twice as much characters as are expected are displayed in this case, thus seemingly some or all of the double width characters are read as width 1. The constant 'mb_internal_encoding' is correctly set to UTF-8, using:

mb_internal_encoding("UTF-8");

Note that when I actually set the mb_internal_encoding to some other value I get other weird results, strings will then be cut-off 50% too fast, so a string of 70 characters gets cut off at position 26. Cut-offs happen in the middle of characters resulting in some mojibake.

My code for the line breaks in the class that does the work looks like this, with mb_wordwrap redirecting to one of the example functions:

private function BreakIntoLines($text) {                
    $dtxt = $this->mb_wordwrap($text, self::LINE_LENGTH, PHP_EOL, true);
    return explode(PHP_EOL, $dtxt);
}

File writing then happens using the classic c-style file writing functions, the following snippet would write one of the lines:

fwrite($this->file, $line);
fwrite($this->file, PHP_EOL);

I verified the output using a gamut of UTF characters and they all seem to display correctly. I seem to be completely lost as to the problem, what is going on here?

Note: Trying to write the algorithm ourselves seems like an inordinate amount of work. There also appears to be combining characters. For example, the sequence \u68\u0300 will actually only occupy one character in a terminal. Experimenting, the code:

$str = json_decode("\"\\u0068\\u0300\"");
var_dump($str); 
echo mb_strlen($str);

prints out: string(3) "h̀" 2

Community
  • 1
  • 1
aphid
  • 1,135
  • 7
  • 20

1 Answers1

2

Use the grapheme functions

There is a nice and clean solution to a part of this question. Use the mb_wordwrap function from the linked question, but replace the mb_* functions with grapheme_* functions. The grapheme is part of the 'intl' extension for php and can actually correctly calculate the character lengths. Also see this page for more information.

Thus, we use grapheme_strlen instead of mb_strlen, and grapheme_substr instead of mb_substr, and so on. Then inside of the word-wrapping functions we can think of the string as a collection of graphemes, which are internally C-style php strings with N ASCII symbols each. This is actually the way the mb_* functions operate as well so the code retains essentially the same structure.

For example:

$str = json_decode("\"\\u0068\\u0300\"");
var_dump($str);         
echo grapheme_strlen($str);

will output: string(3) "h̀" 1.

It will still not correctly compute the wider characters in one go. So in order to actually completely fix the problems, we need an actual implementation. There is a C implementation that used a list of ranges for double-width characters and a longer list with binary traversal for all the 0-width characters, which is reimplemented in php here.

The code below should in theory word-wrap normal text correctly still while also supporting all the strange UTF-8 characters. It does not support languages with other wrapping rules (wrapping at non-space characters) and probably also does not support strange whitespace completely (whitespace outside the ASCII range). However, this code should guarantee line widths of $width characters.

/**
 * Word-wrap a multi-byte character (UTF) string
 * @param string $string Initial string. 
 * @param int $width Maximum width of a line
 * @param string $break Which character(s) to line break with
 * @param bool $cut Whether to force-chop long words. 
 * @return string The chopped string. 
 */
private function mb_wordwrap($string, $width = 75, $break = "\n", $cut = false) {
    $string = (string) $string;
    if ($string === '') {
      return '';
    }
    $break = (string) $break;
    if ($break === '') {
      trigger_error('Break string cannot be empty', E_USER_ERROR);
    }
    $width = (int) $width;
    if ($width === 0 && $cut) {
      trigger_error('Cannot force cut when width is zero', E_USER_ERROR);
    }
    if (mb_check_encoding($string, 'ASCII')) {
      return wordwrap($string, $width, $break, $cut);
    }
    $result = '';        
    // Width on display
    // note: stringLength != stringWidth!
    $breakWidth = $this->trueStringWidth($string);
    $lastStartWidth = $lastSpaceWidth = 0;        
    // These all measure 'length'. 
    // Length in characters       
    $breakLength = strlen($break);
    $lastStartLength = $lastSpaceLength = 0;        
    $G_sz = grapheme_strlen($string);        
    $wPos = 0;
    $lPos = 0;
    // Iterate over graphemes
    // Measure using TrueWidth
    // Cut using ASCII (for speed)
    for($i = 0; $i < $G_sz; ++$i) {        
        $char = grapheme_substr($string, $i, 1);
        $charLength = strlen($char);
        $charWidth = $this->trueStringWidth($char);  
        $lookahead_wPos = $wPos + $charWidth;                 
        // If we already have a line break, preserve it and start anew
        if($breakLength !== 1) {
            $possibleBreak = substr($string, $lPos, $breakLength);
        } else {
            $possibleBreak = substr($string, $lPos, $breakLength);
        }
        if ($possibleBreak === $break) {
            $result .= substr($string, $lastStartLength, $lPos - $lastStartLength + $breakLength);
            $lPos += $breakLength - $charLength;
            $wPos += $breakWidth - $charWidth; 
            $lastStartLength = $lastSpaceLength = $breakLength;
            $lastStartWidth = $lastSpaceWidth = $breakWidth;
            continue;
        }            
        // if we match any 'whitespace' character,
        if(preg_match("/\\h/u", $char)) {
            // Exclude the space itself, do not use the lookahead
            if($wPos - $lastStartWidth >= $width) {
                $result .= substr($string, $lastStartLength, $lPos - $lastStartLength) . $break;
                $lastStartLength = $lPos + $charLength;
                $lastStartWidth = $wPos + $charWidth;
            }
            $lastSpaceWidth = $wPos;
            $lastSpaceLength = $lPos;
            continue;
        }
        // look-ahead one character
        $nextChar = grapheme_substr($string, $i+1, 1);      
        // If we are about to overflow, AND the last space was too far back, 
        if($cut && $lookahead_wPos - $lastStartWidth > $width && $lastStartWidth >= $lastSpaceWidth) {
            $result .= substr($string, $lastStartLength, $lPos - $lastStartLength) . $break;
            $lastStartLength = $lPos;
            $lastStartWidth = $wPos;
            continue;                    
        }  
        if ($lookahead_wPos - $lastStartWidth > $width && $lastStartWidth < $lastSpaceWidth) {
            $result .= substr($string, $lastStartLength, $lastSpaceLength - $lastStartLength) . $break;
            $lastStartLength = $lastSpaceLength = $lastSpaceLength + $charLength;
            $lastStartWidth = $lastSpaceWidth = $lastSpaceWidth + $charWidth;
            continue;
        }            
        $wPos += $charWidth;   
        $lPos += $charLength;
    }
    if($lastStartLength !== $lPos) {
        $result .= substr($string, $lastStartLength, $lPos - $lastStartLength);
    }
    return $result;
}

private function trueStringWidth($str) {
    $w = 0;
    for($i = 0; $i < mb_strlen($str); ++$i) {
        $char = mb_substr($str, $i, 1);
        $w += $this->trueCharWidth($char);
    }
    return $w;
}

private function trueCharWidth($char) {
    $ucs = $this->uniord($char);
    // For non-unicode characters, return 1. 
    // Consoles replace them with 'replacement characters' which have width 1!
    if($ucs === FALSE) {return 1;}
    
    // Do some bit math...
    
    $combi = [
        [ 0x0300, 0x036F ], [ 0x0483, 0x0486 ], [ 0x0488, 0x0489 ],
        [ 0x0591, 0x05BD ], [ 0x05BF, 0x05BF ], [ 0x05C1, 0x05C2 ],
        [ 0x05C4, 0x05C5 ], [ 0x05C7, 0x05C7 ], [ 0x0600, 0x0603 ],
        [ 0x0610, 0x0615 ], [ 0x064B, 0x065E ], [ 0x0670, 0x0670 ],
        [ 0x06D6, 0x06E4 ], [ 0x06E7, 0x06E8 ], [ 0x06EA, 0x06ED ],
        [ 0x070F, 0x070F ], [ 0x0711, 0x0711 ], [ 0x0730, 0x074A ],
        [ 0x07A6, 0x07B0 ], [ 0x07EB, 0x07F3 ], [ 0x0901, 0x0902 ],
        [ 0x093C, 0x093C ], [ 0x0941, 0x0948 ], [ 0x094D, 0x094D ],
        [ 0x0951, 0x0954 ], [ 0x0962, 0x0963 ], [ 0x0981, 0x0981 ],
        [ 0x09BC, 0x09BC ], [ 0x09C1, 0x09C4 ], [ 0x09CD, 0x09CD ],
        [ 0x09E2, 0x09E3 ], [ 0x0A01, 0x0A02 ], [ 0x0A3C, 0x0A3C ],
        [ 0x0A41, 0x0A42 ], [ 0x0A47, 0x0A48 ], [ 0x0A4B, 0x0A4D ],
        [ 0x0A70, 0x0A71 ], [ 0x0A81, 0x0A82 ], [ 0x0ABC, 0x0ABC ],
        [ 0x0AC1, 0x0AC5 ], [ 0x0AC7, 0x0AC8 ], [ 0x0ACD, 0x0ACD ],
        [ 0x0AE2, 0x0AE3 ], [ 0x0B01, 0x0B01 ], [ 0x0B3C, 0x0B3C ],
        [ 0x0B3F, 0x0B3F ], [ 0x0B41, 0x0B43 ], [ 0x0B4D, 0x0B4D ],
        [ 0x0B56, 0x0B56 ], [ 0x0B82, 0x0B82 ], [ 0x0BC0, 0x0BC0 ],
        [ 0x0BCD, 0x0BCD ], [ 0x0C3E, 0x0C40 ], [ 0x0C46, 0x0C48 ],
        [ 0x0C4A, 0x0C4D ], [ 0x0C55, 0x0C56 ], [ 0x0CBC, 0x0CBC ],
        [ 0x0CBF, 0x0CBF ], [ 0x0CC6, 0x0CC6 ], [ 0x0CCC, 0x0CCD ],
        [ 0x0CE2, 0x0CE3 ], [ 0x0D41, 0x0D43 ], [ 0x0D4D, 0x0D4D ],
        [ 0x0DCA, 0x0DCA ], [ 0x0DD2, 0x0DD4 ], [ 0x0DD6, 0x0DD6 ],
        [ 0x0E31, 0x0E31 ], [ 0x0E34, 0x0E3A ], [ 0x0E47, 0x0E4E ],
        [ 0x0EB1, 0x0EB1 ], [ 0x0EB4, 0x0EB9 ], [ 0x0EBB, 0x0EBC ],
        [ 0x0EC8, 0x0ECD ], [ 0x0F18, 0x0F19 ], [ 0x0F35, 0x0F35 ],
        [ 0x0F37, 0x0F37 ], [ 0x0F39, 0x0F39 ], [ 0x0F71, 0x0F7E ],
        [ 0x0F80, 0x0F84 ], [ 0x0F86, 0x0F87 ], [ 0x0F90, 0x0F97 ],
        [ 0x0F99, 0x0FBC ], [ 0x0FC6, 0x0FC6 ], [ 0x102D, 0x1030 ],
        [ 0x1032, 0x1032 ], [ 0x1036, 0x1037 ], [ 0x1039, 0x1039 ],
        [ 0x1058, 0x1059 ], [ 0x1160, 0x11FF ], [ 0x135F, 0x135F ],
        [ 0x1712, 0x1714 ], [ 0x1732, 0x1734 ], [ 0x1752, 0x1753 ],
        [ 0x1772, 0x1773 ], [ 0x17B4, 0x17B5 ], [ 0x17B7, 0x17BD ],
        [ 0x17C6, 0x17C6 ], [ 0x17C9, 0x17D3 ], [ 0x17DD, 0x17DD ],
        [ 0x180B, 0x180D ], [ 0x18A9, 0x18A9 ], [ 0x1920, 0x1922 ],
        [ 0x1927, 0x1928 ], [ 0x1932, 0x1932 ], [ 0x1939, 0x193B ],
        [ 0x1A17, 0x1A18 ], [ 0x1B00, 0x1B03 ], [ 0x1B34, 0x1B34 ],
        [ 0x1B36, 0x1B3A ], [ 0x1B3C, 0x1B3C ], [ 0x1B42, 0x1B42 ],
        [ 0x1B6B, 0x1B73 ], [ 0x1DC0, 0x1DCA ], [ 0x1DFE, 0x1DFF ],
        [ 0x200B, 0x200F ], [ 0x202A, 0x202E ], [ 0x2060, 0x2063 ],            
        [ 0x206A, 0x206F ], [ 0x20D0, 0x20EF ], [ 0x302A, 0x302F ],
        [ 0x3099, 0x309A ], [ 0xA806, 0xA806 ], [ 0xA80B, 0xA80B ],
        [ 0xA825, 0xA826 ], [ 0xFB1E, 0xFB1E ], [ 0xFE00, 0xFE0F ],
        [ 0xFE20, 0xFE23 ], [ 0xFEFF, 0xFEFF ], [ 0xFFF9, 0xFFFB ],
        [ 0x10A01, 0x10A03 ], [ 0x10A05, 0x10A06 ], [ 0x10A0C, 0x10A0F ],
        [ 0x10A38, 0x10A3A ], [ 0x10A3F, 0x10A3F ], [ 0x1D167, 0x1D169 ],
        [ 0x1D173, 0x1D182 ], [ 0x1D185, 0x1D18B ], [ 0x1D1AA, 0x1D1AD ],
        [ 0x1D242, 0x1D244 ], [ 0xE0001, 0xE0001 ], [ 0xE0020, 0xE007F ],
        [ 0xE0100, 0xE01EF ]
      ];

/* test for 8-bit control characters */
if ($ucs === 0)
  return 0;
if ($ucs < 32 || ($ucs >= 0x7f && $ucs < 0xa0))
  return 0;

/* binary search in table of non-spacing characters */
if ($this->binaryIntervalSearch($combi, $ucs))
  return 0;

/* if we arrive here, ucs is not a combining or C0/C1 control character */

return 1 + 
  ($ucs >= 0x1100 &&
   ($ucs <= 0x115f ||                    /* Hangul Jamo init. consonants */
   $ucs == 0x2329 || $ucs == 0x232a ||
  ($ucs >= 0x2e80 && $ucs <= 0xa4cf &&
   $ucs != 0x303f) ||                  /* CJK ... Yi */
  ($ucs >= 0xac00 && $ucs <= 0xd7a3) || /* Hangul Syllables */
  ($ucs >= 0xf900 && $ucs <= 0xfaff) || /* CJK Compatibility Ideographs */
  ($ucs >= 0xfe10 && $ucs <= 0xfe19) || /* Vertical forms */
  ($ucs >= 0xfe30 && $ucs <= 0xfe6f) || /* CJK Compatibility Forms */
  ($ucs >= 0xff00 && $ucs <= 0xff60) || /* Fullwidth Forms */
  ($ucs >= 0xffe0 && $ucs <= 0xffe6) ||
  ($ucs >= 0x20000 && $ucs <= 0x2fffd) ||
  ($ucs >= 0x30000 && $ucs <= 0x3fffd)));
}

private function uniord($c) {
    if (ord($c{0}) >=0 && ord($c{0}) <= 127) {
        return ord($c{0});             
    }
    if (ord($c{0}) >= 192 && ord($c{0}) <= 223) {
        return (ord($c{0})-192)*64 + (ord($c{1})-128);
    }
    if (ord($c{0}) >= 224 && ord($c{0}) <= 239) {
        return (ord($c{0})-224)*4096 + (ord($c{1})-128)*64 + (ord($c{2})-128);
    }
    if (ord($c{0}) >= 240 && ord($c{0}) <= 247) {
        return (ord($c{0})-240)*262144 + (ord($c{1})-128)*4096 + (ord($c{2})-128)*64 
                + (ord($c{3})-128);
    }
    if (ord($c{0}) >= 248 && ord($c{0}) <= 251) {
        return (ord($c{0})-248)*16777216 + (ord($c{1})-128)*262144 + (ord($c{2})-
                128)*4096 + (ord($c{3})-128)*64 + (ord($c{4})-128);
    }
    if (ord($c{0}) >= 252 && ord($c{0}) <= 253) {
        return (ord($c{0})-252)*1073741824 + (ord($c{1})-128)*16777216 + (ord($c{2})-
               128)*262144 + (ord($c{3})-128)*4096 + (ord($c{4})-128)*64 + 
               (ord($c{5})-128);
    }
    if (ord($c{0}) >= 254 && ord($c{0}) <= 255) {    //  error 
        return FALSE;
    }
    return 0;
}   //  function _uniord()



// It is assumed the interval array is sorted!
// It is assumed we have a SIMPLE array (indexed 0, 1, 2, ...). 
private function binaryIntervalSearch($array, $element) {
    if(count($array) === 1) {
        if($array[0][0] <= $element && $element <= $array[0][1]) {
            return true;
        } else {
            return false;
        }
    } else if(count($array) === 0) {
        return false;
    }        
    // split the array into two halves and a central element. 
    $tC = count($array) >> 1;
    // rightmost left element
    if($array[$tC-1][1] >= $element) {
        return $this->binaryIntervalSearch(array_slice($array, 0, $tC), $element);
    } else if($array[$tC][0] <= $element) {
        return $this->binaryIntervalSearch(array_slice($array, $tC), $element);
    }
    return false;   
}
Community
  • 1
  • 1
aphid
  • 1,135
  • 7
  • 20
  • You're self-answering your edit, not the original question (which deals with East-Asian widths rather than grapheme clusters). – georg Aug 18 '14 at 12:14
  • Yes,it seems this would only be part of the solution. – aphid Aug 18 '14 at 12:51