Word wrap text and ignore ANSI escape codes when counting line length

Question

I'm building a CLI app in PHP that has a method to output text:

$out->line('Morbi leo risus, porta ac consectetur ac, vestibulum at eros. Aenean lacinia bibendum nulla sed consectetur. Nullam id dolor id nibh ultricies vehicula ut id elit. Aenean lacinia bibendum nulla sed consectetur. Curabitur blandit tempus porttitor.');

I'm limiting the line output to 80 characters within line() via:

public function line(string $text): void
{
  $this->rawLine(wordwrap($text, 80, PHP_EOL));
}

This prints the output across multiple lines:

Morbi leo risus, porta ac consectetur ac, vestibulum at eros. Aenean lacinia
bibendum nulla sed consectetur. Nullam id dolor id nibh ultricies vehicula ut id
elit. Aenean lacinia bibendum nulla sed consectetur. Curabitur blandit tempus
porttitor.

Now, I can also style parts of the text using ANSI escape codes:

$out->line('Morbi leo risus, ' . Style::inline('porta ac consectetur', ['color' => 'blue', 'attribute' => 'bold']) . ' ac, vestibulum at eros. Aenean lacinia bibendum nulla sed consectetur. Nullam id dolor id nibh ultricies vehicula ut id elit. Aenean lacinia bibendum nulla sed consectetur. Curabitur blandit tempus porttitor.');

Which gets converted to this:

Morbi leo risus, \x1b[34;1mporta ac consectetur\x1b[39;22m ac, vestibulum at
eros. Aenean lacinia bibendum nulla sed consectetur. Nullam id dolor id nibh
ultricies vehicula ut id elit. Aenean lacinia bibendum nulla sed consectetur.
Curabitur blandit tempus porttitor.

And when passed to line(), printed out like this:

Morbi leo risus, porta ac consectetur ac, vestibulum at eros.
Aenean lacinia bibendum nulla sed consectetur. Nullam id dolor id nibh ultricies
vehicula ut id elit. Aenean lacinia bibendum nulla sed consectetur. Curabitur
blandit tempus porttitor.

Where "porta ac consectetur ac" is blue and bold, but if you notice, the line is shorter than before and doesn't break at the same place.

Even though these are non-printing characters, wordwrap() (and strlen()) has issues calculating the length appropriately.

The first line is originally 76 characters without ANSI escape codes:

Morbi leo risus, porta ac consectetur ac, vestibulum at eros. Aenean lacinia

But after adding styles, it comes back as 97 characters:

Morbi leo risus, \x1b[34;1mporta ac consectetur\x1b[39;22m ac, vestibulum at eros. Aenean lacinia

In other parts of the app, like a table, I "solved" this by having a method to set the column value and then a separate method to style said column. That way, I can reliably get the length, but also output the text in the defined style.

I could pass both an unstyled version and then a style version of the text, but that doesn't feel right. Nor does it solve the problem of then splitting the style version accurately.

To solve the issue with line(), I thought about stripping out the ANSI escape codes to get actual length, then add the PHP_EOL break where needed, and then inject the style back in, but that doesn't feel like the right solution and it seems complicated-- how would I even go about doing that?

So my question is: How can I reliably split text containing ANSI escape codes based on text length?

ASCII itself is 7 bit, but can be extended in 8 bit, the escaping is done with escape codes, you will have to account for the escape codes and if 0x1b is found do something special with the counting — sleepyhead, Mar 18 '23 at 20:02
@Nig How close is this to what you need? https://3v4l.org/Blm8b If this is it, I can write up a complete answer. If it's not right, please clarify what I have wrong. — mickmackusa, Mar 23 '23 at 06:18
@mickmackusa This is exactly what I'm looking for. I'd love to see your answer and understand how it works. I don't know how long this took you, but it's very appreciated. — NightHawk, Mar 23 '23 at 15:19

score 0 · Answer 1 · answered Mar 24 '23 at 00:48

Based on an approach I've used to truncate text in another answer (Truncate a multibyte String to n chars), counting the length of segments just needs to ignore the ANSI sequences while counting characters.

To have clean breaks in the text, the snippet below will only replaces spaces with newlines (it is not designed to break on hyphens).

Code: (Demo) (Regex101 Demo)

function ansiSafeWrapper(string $string, int $max = 80) {
    return preg_replace(
        "~(?=(?:(?:\\\\x1b\[[0-9;]+m)?.){{$max}})(?:(?:\\\\x1b\[[0-9;]+m)?.){0,$max}\K ~u",
        PHP_EOL,
        str_replace(PHP_EOL, ' ', $string)
    );
}

$test = <<<'ANSI'
Morbi leo risus, \x1b[34;1mporta ac consectetur\x1b[39;22m ac, vestibulum at
eros. Aenean lacinia bibendum nulla sed consectetur. Nullam id dolor id nibh
ultricies vehicula ut id elit. Aenean lacinia bibendum nulla sed consectetur.
Curabitur blandit tempus porttitor.
ANSI;

echo ansiSafeWrapper($test);

Effectively, the script replaces all newlines with spaces, then injects new newlines where deemed appropriate to return: ^{I've added the character counts at the end of each line for clarity.}

Morbi leo risus, \x1b[34;1mporta ac consectetur\x1b[39;22m ac, vestibulum at eros. Aenean lacinia  (97 char)
bibendum nulla sed consectetur. Nullam id dolor id nibh ultricies vehicula ut id  (80 char)
elit. Aenean lacinia bibendum nulla sed consectetur. Curabitur blandit tempus  (77 char)
porttitor. (10 char)

Which will be visually presented without ANSI sequences as:

Morbi leo risus, porta ac consectetur ac, vestibulum at eros. Aenean lacinia  (76 char)
bibendum nulla sed consectetur. Nullam id dolor id nibh ultricies vehicula ut id  (80 char)
elit. Aenean lacinia bibendum nulla sed consectetur. Curabitur blandit tempus  (77 char)
porttitor. (10 char)

Patter Breakdown:

~                                   #starting pattern delimiter
(?=                                 #start of lookahead
   (?:(?:\\\\x1b\[[0-9;]+m)?.){80}  #consume potential whole ansi code before each single character; match 80 (non-ansi) characters
)                                   #end of lookahead
(?:(?:\\\\x1b\[[0-9;]+m)?.){0,80}   #consume potential whole ansi code before each single character; match upto 80 (non-ansi) characters
\K                                  #forget any characters matched this this point, then match a literal space
~                                   #ending pattern delimiter
u                                   #unicode pattern flag for multibyte safety

I've had a chance to test it, but it doesn't quite work for me. If I use the code exactly as written, it outputs just as you describe (but it doesn't actually process the escape codes-- it prints them). If I change `'ANSI'` to `"ANSI"`, the text has color, but then it's back to ending with `eros.`. I'm going to play with it to see if I can figure it out. — NightHawk, Mar 24 '23 at 21:08
Given a correct/accurate/realistic sample input, I am confident that I can fix my pattern. Let me know what you find out. — mickmackusa, Mar 25 '23 at 03:12
I appreciate you following up. It's been a while since I've had a chance to work on this, but I couldn't make sense of the lookahead and the syntax, so I didn't get the regex working. That said, I came up with something else. It's not as elegant as yours, but I'll post this as an answer, and then you can compare our results. — NightHawk, Apr 12 '23 at 21:31
So long as your answer includes clear input and output, that should be enough for me to rejig my answer. — mickmackusa, Apr 12 '23 at 21:43

NightHawk · Accepted Answer · 2023-04-13T03:32:34.880

This is the input:

$styledText = "Morbi leo risus, \x1b[34;1mporta ac consectetur\x1b[39;22m ac, vestibulum at eros. Aenean lacinia bibendum nulla sed consectetur. Nullam id dolor id nibh ultricies vehicula ut id elit. Aenean lacinia bibendum nulla sed consectetur. Curabitur blandit tempus porttitor.";

The following method strips out escape codes from styled text and saves a copy as clean text.

The clean text is used to add line breaks using wordwrap based on desired column width.

It loops over styled text and injects a line break after every word in which PHP added a line break in clean text.

function wrap(string $styledText) {

  // Strip ANSI escape codes from $styledText
  $cleanText = preg_replace('/\\x1b\[[0-9;]+m/', '', $styledText);

  // Add PHP_EOL to ensure $cleanText does not exceed line width
  $cleanWrappedText = wordwrap($cleanText, 80, PHP_EOL . ' ');

  // Split $styledText and $cleanWrappedText on each space
  $styledTextArray = explode(' ', $styledText);
  $cleanTextArray = explode(' ', $cleanWrappedText);

  // $fusedText will comprise $styledText w/ line breaks from $cleanWrappedText
  $fusedText = '';

  // Loop over each segment (likely a word)
  foreach ($styledTextArray as $index => $segment) {

    // Append word (with ANSI escape codes)
    $fusedText .= $segment;

    // If word has line break in clean version then add line break
    if (str_ends_with($cleanTextArray[$index], PHP_EOL)) {
        $fusedText .= PHP_EOL;
        continue;
    }

    // If word does not have line break in clean version,
    // but there is another word coming, then add space between words
    if (isset($cleanTextArray[$index+1])) {
        $fusedText .= ' ';
    }
  }

  return $fusedText;
}

Note that this can't easily be tested on the web, since the escape codes only style text appropriately when used via a CLI.

My answer still seems to hold up with your sample string. https://3v4l.org/a7cRG — mickmackusa, Apr 13 '23 at 07:32
Here's a screenshot from terminal comparing what happens if I run the text through my function (top) vs. your function (bottom): https://imgur.com/a/bQJkyKh And if I take your code exactly as it is written, the codes are not interpreted, they are printed: https://imgur.com/a/lDYo6jN — NightHawk, Apr 13 '23 at 14:04

Word wrap text and ignore ANSI escape codes when counting line length

2 Answers2