Custom word boundaries
Unicode text has many more potential word boundaries than 8-bit encodings, including 17 space separators, and the full width comma. This solution allows you to customize a list of word boundaries for your application.
Better performance
Have you ever benchmarked the mb_*
family of PHP built-ins? They don't scale well at all. By using a custom nextCharUtf8()
, we can do the same job, but orders of magnitude faster, especially on large strings.
<?php
function wordWrapUtf8(
string $phrase,
int $width = 75,
string $break = "\n",
bool $cut = false,
array $seps = [' ', "\n", "\t", ',']
): string
{
$chunks = [];
$chunk = '';
$len = 0;
$pointer = 0;
while (!is_null($char = nextCharUtf8($phrase, $pointer))) {
$chunk .= $char;
$len++;
if (in_array($char, $seps, true) || ($cut && $len === $width)) {
$chunks[] = [$len, $chunk];
$len = 0;
$chunk = '';
}
}
if ($chunk) {
$chunks[] = [$len, $chunk];
}
$line = '';
$lines = [];
$lineLen = 0;
foreach ($chunks as [$len, $chunk]) {
if ($lineLen + $len > $width) {
if ($line) {
$lines[] = $line;
$lineLen = 0;
$line = '';
}
}
$line .= $chunk;
$lineLen += $len;
}
if ($line) {
$lines[] = $line;
}
return implode($break, $lines);
}
function nextCharUtf8(&$string, &$pointer)
{
// EOF
if (!isset($string[$pointer])) {
return null;
}
// Get the byte value at the pointer
$char = ord($string[$pointer]);
// ASCII
if ($char < 128) {
return $string[$pointer++];
}
// UTF-8
if ($char < 224) {
$bytes = 2;
} elseif ($char < 240) {
$bytes = 3;
} elseif ($char < 248) {
$bytes = 4;
} elseif ($char == 252) {
$bytes = 5;
} else {
$bytes = 6;
}
// Get full multibyte char
$str = substr($string, $pointer, $bytes);
// Increment pointer according to length of char
$pointer += $bytes;
// Return mb char
return $str;
}