Multi-byte safe wordwrap() function for UTF-8

Question

PHP's wordwrap() function doesn't work correctly for multi-byte strings like UTF-8.

There are a few examples of mb safe functions in the comments, but with some different test data they all seem to have some problems.

The function should take the exact same parameters as wordwrap().

Specifically be sure it works to:

cut mid-word if $cut = true, don't cut mid-word otherwise
not insert extra spaces in words if $break = ' '
also work for $break = "\n"
work for ASCII, and all valid UTF-8

The two methods [`s($str)->truncate($length, $break)`](https://github.com/delight-im/PHP-Str/blob/8fd0c608d5496d43adaa899642c1cce047e076dc/src/Str.php#L233) and [`s($str)->truncateSafely($length, $break)`](https://github.com/delight-im/PHP-Str/blob/8fd0c608d5496d43adaa899642c1cce047e076dc/src/Str.php#L246) do exactly that, as found in [this standalone library](https://github.com/delight-im/PHP-Str). The first one is for `$cut = true` and the second one for `$cut = false`. They're Unicode-safe. — caw, Jul 27 '16 at 03:37

Fosfor · Answer 1 · 2011-02-14T14:28:15.593

21

I haven't found any working code for me. Here is what I've written. For me it is working, thought it is probably not the fastest.

function mb_wordwrap($str, $width = 75, $break = "\n", $cut = false) {
    $lines = explode($break, $str);
    foreach ($lines as &$line) {
        $line = rtrim($line);
        if (mb_strlen($line) <= $width)
            continue;
        $words = explode(' ', $line);
        $line = '';
        $actual = '';
        foreach ($words as $word) {
            if (mb_strlen($actual.$word) <= $width)
                $actual .= $word.' ';
            else {
                if ($actual != '')
                    $line .= rtrim($actual).$break;
                $actual = $word;
                if ($cut) {
                    while (mb_strlen($actual) > $width) {
                        $line .= mb_substr($actual, 0, $width).$break;
                        $actual = mb_substr($actual, $width);
                    }
                }
                $actual .= ' ';
            }
        }
        $line .= trim($actual);
    }
    return implode($break, $lines);
}

edited Feb 14 '11 at 14:28

answered Feb 14 '11 at 03:03

Fosfor

369
2
8

Worked well for me too! – Ben Sinclair Oct 23 '14 at 05:57
I've been using this for a few years, but not heavily. Anyway I included this function in a php class I put as a gist on github under MIT and just need to verify that is okay - https://gist.github.com/AliceWonderMiscreations/7694e8aa644cf1b1fc3910b1c949e092 – Alice Wonder Dec 26 '17 at 19:01
tried this code with PHP 5.6 and didnt worked for me =( It requires ini_set and mb_internal_encoding to be set? – bovino Marcelo Bezerra Jan 25 '18 at 13:45
@AliceWonder Didn't find the link any more, but generally no problem:) – Fosfor May 22 '18 at 19:40

Fleshgrinder · Answer 2 · 2013-09-04T18:03:19.330

Because no answer was handling every use case, here is something that does. The code is based on Drupal’s AbstractStringWrapper::wordWrap.

<?php

/**
 * Wraps any string to a given number of characters.
 *
 * This implementation is multi-byte aware and relies on {@link
 * http://www.php.net/manual/en/book.mbstring.php PHP's multibyte
 * string extension}.
 *
 * @see wordwrap()
 * @link https://api.drupal.org/api/drupal/core%21vendor%21zendframework%21zend-stdlib%21Zend%21Stdlib%21StringWrapper%21AbstractStringWrapper.php/function/AbstractStringWrapper%3A%3AwordWrap/8
 * @param string $string
 *   The input string.
 * @param int $width [optional]
 *   The number of characters at which <var>$string</var> will be
 *   wrapped. Defaults to <code>75</code>.
 * @param string $break [optional]
 *   The line is broken using the optional break parameter. Defaults
 *   to <code>"\n"</code>.
 * @param boolean $cut [optional]
 *   If the <var>$cut</var> is set to <code>TRUE</code>, the string is
 *   always wrapped at or before the specified <var>$width</var>. So if
 *   you have a word that is larger than the given <var>$width</var>, it
 *   is broken apart. Defaults to <code>FALSE</code>.
 * @return string
 *   Returns the given <var>$string</var> wrapped at the specified
 *   <var>$width</var>.
 */
function mb_wordwrap($string, $width = 75, $break = "\n", $cut = false) {
  $string = (string) $string;
  if ($string === '') {
    return '';
  }

  $break = (string) $break;
  if ($break === '') {
    trigger_error('Break string cannot be empty', E_USER_ERROR);
  }

  $width = (int) $width;
  if ($width === 0 && $cut) {
    trigger_error('Cannot force cut when width is zero', E_USER_ERROR);
  }

  if (strlen($string) === mb_strlen($string)) {
    return wordwrap($string, $width, $break, $cut);
  }

  $stringWidth = mb_strlen($string);
  $breakWidth = mb_strlen($break);

  $result = '';
  $lastStart = $lastSpace = 0;

  for ($current = 0; $current < $stringWidth; $current++) {
    $char = mb_substr($string, $current, 1);

    $possibleBreak = $char;
    if ($breakWidth !== 1) {
      $possibleBreak = mb_substr($string, $current, $breakWidth);
    }

    if ($possibleBreak === $break) {
      $result .= mb_substr($string, $lastStart, $current - $lastStart + $breakWidth);
      $current += $breakWidth - 1;
      $lastStart = $lastSpace = $current + 1;
      continue;
    }

    if ($char === ' ') {
      if ($current - $lastStart >= $width) {
        $result .= mb_substr($string, $lastStart, $current - $lastStart) . $break;
        $lastStart = $current + 1;
      }

      $lastSpace = $current;
      continue;
    }

    if ($current - $lastStart >= $width && $cut && $lastStart >= $lastSpace) {
      $result .= mb_substr($string, $lastStart, $current - $lastStart) . $break;
      $lastStart = $lastSpace = $current;
      continue;
    }

    if ($current - $lastStart >= $width && $lastStart < $lastSpace) {
      $result .= mb_substr($string, $lastStart, $lastSpace - $lastStart) . $break;
      $lastStart = $lastSpace = $lastSpace + 1;
      continue;
    }
  }

  if ($lastStart !== $current) {
    $result .= mb_substr($string, $lastStart, $current - $lastStart);
  }

  return $result;
}

?>

Works great for cyrillic words in UTF-8. – Aleksey Kuznetsov May 01 '19 at 16:21 — Aleksey Kuznetsov, May 01 '19 at 16:21

score 5 · Answer 3 · answered Feb 10 '11 at 19:30

5

/**
 * wordwrap for utf8 encoded strings
 *
 * @param string $str
 * @param integer $len
 * @param string $what
 * @return string
 * @author Milian Wolff <mail@milianw.de>
 */

function utf8_wordwrap($str, $width, $break, $cut = false) {
    if (!$cut) {
        $regexp = '#^(?:[\x00-\x7F]|[\xC0-\xFF][\x80-\xBF]+){'.$width.',}\b#U';
    } else {
        $regexp = '#^(?:[\x00-\x7F]|[\xC0-\xFF][\x80-\xBF]+){'.$width.'}#';
    }
    if (function_exists('mb_strlen')) {
        $str_len = mb_strlen($str,'UTF-8');
    } else {
        $str_len = preg_match_all('/[\x00-\x7F\xC0-\xFD]/', $str, $var_empty);
    }
    $while_what = ceil($str_len / $width);
    $i = 1;
    $return = '';
    while ($i < $while_what) {
        preg_match($regexp, $str,$matches);
        $string = $matches[0];
        $return .= $string.$break;
        $str = substr($str, strlen($string));
        $i++;
    }
    return $return.$str;
}

Total time: 0.0020880699 is good time :)

answered Feb 10 '11 at 19:30

sacrebleu

75
1
2

If not `$cut`, this function has a flaw. It won't wrap earlier if possible (which is what [`wordwrap`](http://php.net/wordwrap) would do. [See demo](http://codepad.viper-7.com/zxE64Z.php53_t). Not a solution, but a related answer has another [Wordwrap Regex](http://stackoverflow.com/q/2682861/367456#2689242). – hakre Dec 13 '11 at 13:41
This behaviors differently from `wordwrap()`, concerning spaces. – mpyw Aug 10 '13 at 03:11
This one work when cut=true on Chinese simplified text – exiang Jul 19 '15 at 03:45
1

This doesn't works for cyrillic. Breaks words. Didn't look for a reason, going to try another solution. – Aleksey Kuznetsov May 01 '19 at 16:20

score 2 · Answer 4 · answered Sep 12 '18 at 03:21

Just want to share some alternative I found on the net.

<?php
if ( !function_exists('mb_str_split') ) {
    function mb_str_split($string, $split_length = 1)
    {
        mb_internal_encoding('UTF-8'); 
        mb_regex_encoding('UTF-8');  

        $split_length = ($split_length <= 0) ? 1 : $split_length;

        $mb_strlen = mb_strlen($string, 'utf-8');

        $array = array();

        for($i = 0; $i < $mb_strlen; $i += $split_length) {
            $array[] = mb_substr($string, $i, $split_length);
        }

        return $array;
    }
}

Using mb_str_split, you can use join to combine the words with <br>.

<?php
    $text = '<utf-8 content>';

    echo join('<br>', mb_str_split($text, 20));

And finally create your own helper, perhaps mb_textwrap

<?php

if( !function_exists('mb_textwrap') ) {
    function mb_textwrap($text, $length = 20, $concat = '<br>') 
    {
        return join($concat, mb_str_split($text, $length));
    }
}

$text = '<utf-8 content>';
// so simply call
echo mb_textwrap($text);

See screenshot demo:

score 2 · Answer 5 · answered Nov 25 '19 at 19:06

Custom word boundaries

Unicode text has many more potential word boundaries than 8-bit encodings, including 17 space separators, and the full width comma. This solution allows you to customize a list of word boundaries for your application.

Better performance

Have you ever benchmarked the mb_* family of PHP built-ins? They don't scale well at all. By using a custom nextCharUtf8(), we can do the same job, but orders of magnitude faster, especially on large strings.

<?php

function wordWrapUtf8(
  string $phrase,
  int $width = 75,
  string $break = "\n",
  bool $cut = false,
  array $seps = [' ', "\n", "\t", '，']
): string
{
  $chunks = [];
  $chunk = '';
  $len = 0;
  $pointer = 0;
  while (!is_null($char = nextCharUtf8($phrase, $pointer))) {
    $chunk .= $char;
    $len++;
    if (in_array($char, $seps, true) || ($cut && $len === $width)) {
      $chunks[] = [$len, $chunk];
      $len = 0;
      $chunk = '';
    }
  }
  if ($chunk) {
    $chunks[] = [$len, $chunk];
  }
  $line = '';
  $lines = [];
  $lineLen = 0;
  foreach ($chunks as [$len, $chunk]) {
    if ($lineLen + $len > $width) {
      if ($line) {
        $lines[] = $line;
        $lineLen = 0;
        $line = '';
      }
    }
    $line .= $chunk;
    $lineLen += $len;
  }
  if ($line) {
    $lines[] = $line;
  }
  return implode($break, $lines);
}

function nextCharUtf8(&$string, &$pointer)
{
  // EOF
  if (!isset($string[$pointer])) {
    return null;
  }

  // Get the byte value at the pointer
  $char = ord($string[$pointer]);

  // ASCII
  if ($char < 128) {
    return $string[$pointer++];
  }

  // UTF-8
  if ($char < 224) {
    $bytes = 2;
  } elseif ($char < 240) {
    $bytes = 3;
  } elseif ($char < 248) {
    $bytes = 4;
  } elseif ($char == 252) {
    $bytes = 5;
  } else {
    $bytes = 6;
  }

  // Get full multibyte char
  $str = substr($string, $pointer, $bytes);

  // Increment pointer according to length of char
  $pointer += $bytes;

  // Return mb char
  return $str;
}

score 1 · Answer 6 · answered Oct 26 '18 at 16:35

function mb_wordwrap($str, $width = 74, $break = "\r\n", $cut = false)
        {
            return preg_replace(
                '~(?P<str>.{' . $width . ',}?' . ($cut ? '(?(?!.+\s+)\s*|\s+)' : '\s+') . ')(?=\S+)~mus',
                '$1' . $break,
                $str
            );
        }

score 0 · Answer 7 · answered Mar 19 '13 at 08:49

Here is the multibyte wordwrap function i have coded taking inspiration from of others found on the internet.

function mb_wordwrap($long_str, $width = 75, $break = "\n", $cut = false) {
    $long_str = html_entity_decode($long_str, ENT_COMPAT, 'UTF-8');
    $width -= mb_strlen($break);
    if ($cut) {
        $short_str = mb_substr($long_str, 0, $width);
        $short_str = trim($short_str);
    }
    else {
        $short_str = preg_replace('/^(.{1,'.$width.'})(?:\s.*|$)/', '$1', $long_str);
        if (mb_strlen($short_str) > $width) {
            $short_str = mb_substr($short_str, 0, $width);
        }
    }
    if (mb_strlen($long_str) != mb_strlen($short_str)) {
        $short_str .= $break;
    }
    return $short_str;
}

Dont' forget to configure PHP for using UTF-8 with :

ini_set('default_charset', 'UTF-8');
mb_internal_encoding('UTF-8');
mb_regex_encoding('UTF-8');

I hope this will help. Guillaume

score 0 · Answer 8 · answered Sep 29 '10 at 19:42

0

Here's my own attempt at a function that passed a few of my own tests, though I can't promise it's 100% perfect, so please post a better one if you see a problem.

/**
 * Multi-byte safe version of wordwrap()
 * Seems to me like wordwrap() is only broken on UTF-8 strings when $cut = true
 * @return string
 */
function wrap($str, $len = 75, $break = " ", $cut = true) { 
    $len = (int) $len;

    if (empty($str))
        return ""; 

    $pattern = "";

    if ($cut)
        $pattern = '/([^'.preg_quote($break).']{'.$len.'})/u'; 
    else
        return wordwrap($str, $len, $break);

    return preg_replace($pattern, "\${1}".$break, $str); 
}

answered Sep 29 '10 at 19:42

philfreo

41,941
26
128
141

`wordwrap()` wraps only at a space character when `$cut` is `false`. This is why it works for UTF-8 which is designed to be backwards-compatible - characters not defined in ASCII are all encoded with the highest bit set, preventing collision with ASCII chars including the space. – Arc Sep 29 '10 at 20:22
Can you clarify? `wordwrap()` doesn't work for UTF-8, for example. I'm not sure what you mean by "wraps only at a space..." – philfreo Sep 29 '10 at 23:04
1

test your function on this string: проверка проверка – Yaroslav Nov 15 '13 at 21:41
`wordwrap` wraps based on the number of *bytes*, not the number of *characters*. For those who are too lazy to test, `wordwrap('проверка проверка', 32)` will put each word on a separate line. – toxalot Mar 25 '14 at 15:44

score 0 · Answer 9 · answered Oct 17 '22 at 17:27

In my case, the input was a japanese paragraph and needed to have line break at around 70 character, which was not wrapped with the solution given above.

I ended up writing a solution which works for me. I have not tested the snippet for performance.

function trimtonumchars($value, $numchars) {
    $value = trim($value);
    if (strlen($value) < $numchars) {
        return str_pad($value, $numchars);
    } else {
        return substr($value, 0, $numchars);
    }
}

function utf8_wordwrap($long_str, $width, $break, $cut) {
    function internalsafetrimtonumchars($value, $numchars) {
        while (true) {
            $data = trimtonumchars($value, $numchars);
            json_encode($data);
            if(json_last_error() == 0) {
                return array($data, $numchars);
            }
            $numchars--;
        }
    }

    $tokens = array();
    while(strlen($long_str) > 0) {
        $result = internalsafetrimtonumchars($long_str, $width);
        $token = $result[0];
        $length = $result[1];
        $long_str = substr(trim($long_str), $length);
    
        array_push($tokens, $token);   
    }
    return join($break, $tokens);
}

score -2 · Accepted Answer · answered Sep 30 '10 at 18:03

-2

This one seems to work well...

function mb_wordwrap($str, $width = 75, $break = "\n", $cut = false, $charset = null) {
    if ($charset === null) $charset = mb_internal_encoding();

    $pieces = explode($break, $str);
    $result = array();
    foreach ($pieces as $piece) {
      $current = $piece;
      while ($cut && mb_strlen($current) > $width) {
        $result[] = mb_substr($current, 0, $width, $charset);
        $current = mb_substr($current, $width, 2048, $charset);
      }
      $result[] = $current;
    }
    return implode($break, $result);
}

answered Sep 30 '10 at 18:03

philfreo

41,941
26
128
141

shouldn't $break be rather PHP_EOL? so it'd be cross-platform? – ThatGuy Jul 25 '11 at 00:39
1

mmm. it also doesn't split long words. – ThatGuy Jul 25 '11 at 00:47
Why do you explode the string using line-breaks? Shouldn't you be using spaces instead (for splitting words)? – Edson Medina Nov 09 '12 at 10:05
You should not use explode also, because if case of some encodings (like UCS-2) encoding this may break some symbols. – Zebooka Mar 14 '14 at 08:24
If the goal is to add multi-byte support to PHP's standard `wordwrap`, the function should preserve original line breaks regardless of type (`\r`, `\n`, `\r\n`) and regardless of string used for `$break`. – toxalot Mar 25 '14 at 18:27

Multi-byte safe wordwrap() function for UTF-8

10 Answers10

Custom word boundaries

Better performance

Linked