How to parse cursor ANSI escape codes?

Question

I'm writing code for processing ANSI escape codes for cursor for jQuery Terminal. but have problems, not sure how it should work, I've got weird results.

I'm testing with ervy library.

and using this code:

function scatter_plot() {
     const scatterData = [];

     for (let i = 1; i < 17; i++) {
         i < 6 ? scatterData.push({ key: 'A', value: [i, i], style: ervy.fg('red', '*') })
           : scatterData.push({ key: 'A', value: [i, 6], style: ervy.fg('red', '*') });
     }

     scatterData.push({ key: 'B', value: [2, 6], style: ervy.fg('blue', '# '), side: 2 });
     scatterData.push({ key: 'C', value: [0, 0], style: ervy.bg('cyan', 2) });

     var plot = ervy.scatter(scatterData, { legendGap: 18, width: 15 });
     // same as Linux XTERM where 0 code is interpreted as 1.
     var formatting = $.terminal.from_ansi(plot.replace(/\x1b\[0([A-D])/g, '\x1b[1$1'));
     return formatting;
}

$.terminal.defaults.formatters = [];
var term = $('body').terminal();
term.echo(scatter_plot());

it should look like in Linux Xterm:

But it looks like this, see codepen demo

While I was writing the question changing few +1 and -1 (see processing A-F ANSI escapes in the code) when moving cursor give this result (code snippet have latest code).

First line got overwritten by spaces and whole plot is one to top and one to right (except 0,0 cyan dot that should be below " |" and 2 characters wide, so you should see right half of it, this one is correct but the rest is not)

this is my new code for processing cursor, I'm doing this just before processing colors, so the code is not that complex.

// -------------------------------------------------------------------------------
var ansi_re = /(\x1B\[[0-9;]*[A-Za-z])/g;
var cursor_re = /(.*)\r?\n\x1b\[1A\x1b\[([0-9]+)C/;
var move_cursor_split = /(\x1b\[[0-9]+[A-G])/g;
var move_cursor_match = /^\x1b\[([0-9]+)([A-G])/;
// -------------------------------------------------------------------------------
function parse_ansi_cursor(input) {
    /*
        (function(log) {
            console.log = function(...args) {
                if (true || cursor.y === 11) {
                    return log.apply(console, args);
                }
            };
        })(console.log);
        */
    function length(text) {
        return text.replace(ansi_re, '').length;
    }
    function get_index(text, x) {
        var splitted = text.split(ansi_re);
        var format = 0;
        var count = 0;
        var prev_count = 0;
        for (var i = 0; i < splitted.length; i++) {
            var string = splitted[i];
            if (string) {
                if (string.match(ansi_re)) {
                    format += string.length;
                } else {
                    count += string.length;
                    if (count >= x) {
                        var rest = x - prev_count;
                        return format + rest;
                    }
                    prev_count = count;
                }
            }
        }
        return i;
    }
    // ansi aware substring, it just and add removed ansi escapes
    // at the beginning we don't care if the were disabled with 0m
    function substring(text, start, end) {
        var result = text.substring(start, end);
        if (start === 0 || !text.match(ansi_re)) {
            return result;
        }
        var before = text.substring(0, start);
        var match = before.match(ansi_re);
        if (match) {
            return before.match(ansi_re).join('') + result;
        }
        return result;
    }
    // insert text at cursor position
    // result is array of splitted arrays that form single line
    function insert(text) {
        if (!text) {
            return;
        }
        if (!result[cursor.y]) {
            result[cursor.y] = [];
        }
        var index = 0;
        var sum = 0;
        var len, after;
        function inject() {
            index++;
            if (result[cursor.y][index]) {
                result[cursor.y].splice(index, 0, null);
            }
        }
        if (cursor.y === 11) {
            //debugger;
        }
        if (text == "[46m  [0m") {
            //debugger;
        }
        console.log({...cursor, text});
        if (cursor.x === 0 && result[cursor.y][index]) {
            source = result[cursor.y][0];
            len = length(text);
            var i = get_index(source, len);
            if (length(source) < len) {
                after = result[cursor.y][index + 1];
                if (after) {
                    i = get_index(after, len - length(source));
                    after = substring(after, i);
                    result[cursor.y].splice(index, 2, null, after);
                } else {
                    result[cursor.y].splice(index, 1, null);
                }
            } else {
                after = substring(source, i);
                result[cursor.y].splice(index, 1, null, after);
            }
        } else {
            var limit = 100000; // infite loop guard
            var prev_sum = 0;
            // find in which substring to insert the text
            while (index < cursor.x) {
                if (!limit--) {
                    warn('[WARN] To many loops');
                    break;
                }
                var source = result[cursor.y][index];
                if (!source) {
                    result[cursor.y].push(new Array(cursor.x - prev_sum).join(' '));
                    index++;
                    break;
                }
                if (sum === cursor.x) {
                    inject();
                    break;
                }
                len = length(source);
                prev_sum = sum;
                sum += len;
                if (sum === cursor.x) {
                    inject();
                    break;
                }
                if (sum > cursor.x) {
                    var pivot = get_index(source, cursor.x - prev_sum);
                    var before = substring(source, 0, pivot);
                    var end = get_index(source, length(text));
                    after = substring(source, pivot + end);
                    if (!after.length) {
                        result[cursor.y].splice(index, 1, before);
                    } else {
                        result[cursor.y].splice(index, 1, before, null, after);
                    }
                    index++;
                    break;
                } else {
                    index++;
                }
            }
        }
        cursor.x += length(text);
        result[cursor.y][index] = text;
    }
    if (input.match(move_cursor_split)) {
        var lines = input.split('\n').filter(Boolean);
        var cursor = {x: 0, y: -1};
        var result = [];
        for (var i = 0; i < lines.length; ++i) {
            console.log('-------------------------------------------------');
            var string = lines[i];
            cursor.x = 0;
            cursor.y++;
            var splitted = string.split(move_cursor_split).filter(Boolean);
            for (var j = 0; j < splitted.length; ++j) {
                var part = splitted[j];
                console.log(part);
                var match = part.match(move_cursor_match);
                if (match) {
                    var ansi_code = match[2];
                    var value = +match[1];
                    console.log({code: ansi_code, value, ...cursor});
                    if (value === 0) {
                        continue;
                    }
                    switch (ansi_code) {
                        case 'A': // UP
                            cursor.y -= value;
                            break;
                        case 'B': // Down
                            cursor.y += value - 1;
                            break;
                        case 'C': // forward
                            cursor.x += value + 1;
                            break;
                        case 'D': // Back
                            cursor.x -= value + 1;
                            break;
                        case 'E': // Cursor Next Line
                            cursor.x = 0;
                            cursor.y += value - 1;
                            break;
                        case 'F': // Cursor Previous Line
                            cursor.x = 0;
                            cursor.y -= value + 1;
                            break;
                    }
                    if (cursor.x < 0) {
                        cursor.x = 0;
                    }
                    if (cursor.y < 0) {
                        cursor.y = 0;
                    }
                } else {
                    insert(part);
                }
            }
        }
        return result.map(function(line) {
            return line.join('');
        }).join('\n');
    }
    return input;
}

The result = []; in code is array of lines where single line may be split into multiple sub strings when inserting the text at cursor, maybe the code would be simpler if they would be array of strings. Right now I want only fix the cursor position.

Here is the codepen demo with from_ansi function embeded (inside there is parse_ansi_cursor that is problematic). Sorry there is lot of code, but parsing ANSI escape codes is not simple.

What I'm not sure how should work is moving the cursor (right now it have + 1 or - 1, I'm not sure about this) I'm also not sure if I should increase cursor.y before each line. I'm not 100% sure how this should work. I've looked into Linux Xterm code but didn't found a clues. Looked at Xterm.js but the ervy plot is completely broken for those scatter plot.

my from_ansi function had original code that was processing some ANSI cursor codes like this one:

        input = input.replace(/\x1b\[([0-9]+)C/g, function(_, num) {
            return new Array(+num + 1).join(' ');
        });

only C, forward just add blanks, it was working for ANSI art but not work with ervy scatter plot.

I think it's not too broad, it's just question about moving cursor and processing newlines using ANSI escape codes. Also it's suppose to be simple case, cursor should move only inside single string not outside like in real terminal (ervy plot output ANSI escape codes like that).

I'm fine with answers that explain how to process the string and how to move the cursor that will work, but if you can provide fixes to the code I would be great. I prefer fixes to my code now whole new implementation unless is much simpler and it's a function parse_ansi_cursor(input) and work the same with rest of the code but with fixed cursor movement.

EDIT: I've found that my input.split('\n').filter(Boolean) was wrong it should be:

            var lines = input.split('\n');
            if (input.match(/^\n/)) {
                lines.shift();
            }
            if (input.match(/\n$/)) {
                lines.pop();
            }

and it seems that some old spec for ANSI escapes say that 0 is not zero but placeholder for default which is 1. That was removed from spec but Xterm is still using this. So I've added this line for parsing code, if there is 0A or A got value 1.

var value = match[1].match(/^0?$/) ? 1 : +match[1];

the plot looks better, but there are still issues with the cursor. (I think it's cursor - I'm not 100% sure).

I've changed the +1/-1 again now it's closer (Almost the same as in XTerm). Buss still there's need to be bug in my code.

EDIT:

afer answer by @jerch I've tried to use node ansi parser, have the same issue don't know how to process the cursor:

var cursor = {x:0,y:0};
result = [];
var terminal = {
    inst_p: function(s) {
        var line = result[cursor.y];
        if (!line) {
            result[cursor.y] = s;
        } else if (cursor.x === 0) {
            result[cursor.y] = s + line.substring(s.length);
        } else if (line.length < cursor.x) {
            var len = cursor.x - (line.length - 1);
            result[cursor.y] += new Array(len).join(' ') + s;
        } else if (line.length === cursor.x) {
            result[cursor.y] += s;
        } else {
            var before = line.substring(0, cursor.x);
            var after = line.substring(cursor.x + s.length);
            result[cursor.y] = before + s + after;
        }
        cursor.x += s.length;
        console.log({s, ...cursor, line: result[cursor.y]});
    },
    inst_o: function(s) {console.log('osc', s);},
    inst_x: function(flag) {
        var code = flag.charCodeAt(0);
        if (code === 10) {
            cursor.y++;
            cursor.x = 0;
        }
    },
    inst_c: function(collected, params, flag) {
        console.log({collected, params, flag});
        var value = params[0] === 0 ? 1 : params[0];
        switch(flag) {
            case 'A': // UP
                cursor.y -= value;
                break;
            case 'B': // Down
                cursor.y += value - 1;
                break;
            case 'C': // forward
                cursor.x += value;
                break;
            case 'D': // Back
                cursor.x -= value;
                break;
            case 'E': // Cursor Next Line
                cursor.x = 0;
                cursor.y += value;
                break;
            case 'F': // Cursor Previous Line
                cursor.x = 0;
                cursor.y -= value;
                break;
        }
    },
    inst_e: function(collected, flag) {console.log('esc', collected, flag);},
    inst_H: function(collected, params, flag) {console.log('dcs-Hook', collected, params, flag);},
    inst_P: function(dcs) {console.log('dcs-Put', dcs);},
    inst_U: function() {console.log('dcs-Unhook');}
};
var parser = new AnsiParser(terminal);
parser.parse(input);
return result.join('\n');

This is just simple example that ignore everything except newline and cursor movement.

Here is the output:

UPDATE:

It seems that every cursor movement should be just += value or -= value and my value - 1; was just correcting to bug in ervy library that was not working on clear terminal.

jerch · Accepted Answer · 2019-12-29T16:27:49.973

To begin with - a Regexp based approach is not ideal to handle escape sequences. The reason for this are complicated interactions between various terminal sequences, as some break a former not yet closed one while others keep working in the middle of another (like some control codes) and the "outer" sequence would still finish correctly. You would have to pull in all these edge cases into every single regexp (see https://github.com/xtermjs/xterm.js/issues/2607#issuecomment-562648768 for an illustration).

In general parsing escape sequences is quite tricky, we even have an issue regarding that in terminal-wg. Hopefully we manage to get some minimal parsing requirements from this in the future. Most certainly it will not be regexp-based ;)

All that said, its much easier to go with a real parser, that deals with all the edge cases. A good starting point for a DEC compatible parser is https://vt100.net/emu/dec_ansi_parser. For cursor handling you have to handle at least these states with all actions:

ground
escape
csi_entry
csi_ignore
csi_param
csi_intermediate

plus all other states as dummy entries. Also control codes need special care (action execute), as they might interfer anytime with any other sequence with different results.

To make things even worse, the official ECMA-48 specifiction slightly differs for certain aspects from the DEC parser. Still most emulators used these days try to aim for DEC VT100+ compatibility.

If you dont want to write the parser yourself, you can either use/modify my old parser or the one we have in xterm.js (the latter might be harder to integrate as it operates on UTF32 codepoints).

Thanks for the answer, Will try use those links as reference, also I don't need to support everything since I don't need full VT100 emulator. I'm trying to parse only cursor movement to show some ANSI art and in this case plots. That state diagram probably is what I need to look at. — jcubic, Dec 29 '19 at 16:38
Yeah - you prolly dont need the whole thing just for cursor movements, still quite alot to correctly deal with error recovery and such. A pure regexp based solution will always fail in this regard (still might be possible to work in 80%, so maybe go with the 80/20 rule). — jerch, Dec 29 '19 at 16:42
I've edited the question with your node parser, I have the same issue with cursor it should render just ASCII but there is blank space below the header and plot is shifted. — jcubic, Dec 29 '19 at 17:42
Could you open an issue in your repo and point me to it? This way we dont have to pollute SF with forth and back fixes, which just makes the question unreadable. For the question at hand - I would try to "step debug" the sequences to see which one inserts the additional space/line. Either the handling of that one is already the culprit, or the cursor state was already wrong by a previous sequence, that got wrongly handled. — jerch, Dec 29 '19 at 18:04

How to parse cursor ANSI escape codes?

1 Answers1