2

When tokenizing in superpower, how to match a string only if it is the first thing in a line (note: this is a different question than this one) ?

For example, assume I have a language with only the following 4 characters (' ', ':', 'X', 'Y'), each of which is a token. There is also a 'Header' token to capture cases of the following regex pattern /^[XY]+:/ (any number of Xs and Ys followed by a colon, only if they start the line).

Here is a quick class for testing (the 4th test-case fails):

using System;
using Superpower;
using Superpower.Parsers;
using Superpower.Tokenizers;

public enum Tokens { Space, Colon, Header, X, Y }

public class XYTokenizer
{
    static void Main(string[] args)
    {
        Test("X", Tokens.X);
        Test("XY", Tokens.X, Tokens.Y);
        Test("X Y:", Tokens.X, Tokens.Space, Tokens.Y, Tokens.Colon);
        Test("X: X", Tokens.Header, Tokens.Space, Tokens.X);
    }

    public static readonly Tokenizer<Tokens> tokenizer = new TokenizerBuilder<Tokens>()
        .Match(Character.EqualTo('X'), Tokens.X)
        .Match(Character.EqualTo('Y'), Tokens.Y)
        .Match(Character.EqualTo(':'), Tokens.Colon)
        .Match(Character.EqualTo(' '), Tokens.Space)
        .Build();

    static void Test(string input, params Tokens[] expected)
    {
        var tokens = tokenizer.Tokenize(input);
        var i = 0;
        foreach (var t in tokens)
        {
            if (t.Kind != expected[i])
            {
                Console.WriteLine("tokens[" + i + "] was Tokens." + t.Kind
                    + " not Tokens." + expected[i] + " for '" + input + "'");
                return;
            }
            i++;
        }
        Console.WriteLine("OK");
    }
}
rednoyz
  • 1,318
  • 10
  • 24
  • You'd probably have to build a custom tokenizer that doesn't use the `TokenizerBuilder`. You have more control of how the tokens are parsed when you build your own tokenizer. – jtate Oct 28 '18 at 22:02
  • yes, that is what I thought – rednoyz Oct 29 '18 at 17:34

1 Answers1

1

I came up with a custom Tokenizer based on the example found here. I added comments throughout the code so you can follow what's happening.

public class MyTokenizer : Tokenizer<Tokens>
{
    protected override IEnumerable<Result<Tokens>> Tokenize(TextSpan input)
    {
        Result<char> next = input.ConsumeChar();

        bool checkForHeader = true;

        while (next.HasValue)
        {
            // need to check for a header when starting a new line
            if (checkForHeader)
            {
                var headerStartLocation = next.Location;
                var tokenQueue = new List<Result<Tokens>>();
                while (next.HasValue && (next.Value == 'X' || next.Value == 'Y'))
                {
                    tokenQueue.Add(Result.Value(next.Value == 'X' ? Tokens.X : Tokens.Y, next.Location, next.Remainder));
                    next = next.Remainder.ConsumeChar();
                }

                // only if we had at least one X or one Y
                if (tokenQueue.Any())
                {
                    if (next.HasValue && next.Value == ':')
                    {
                        // this is a header token; we have to return a Result of the start location 
                        // along with the remainder at this location
                        yield return Result.Value(Tokens.Header, headerStartLocation, next.Remainder);
                        next = next.Remainder.ConsumeChar();
                    }
                    else
                    {
                        // this isn't a header; we have to return all the tokens we parsed up to this point
                        foreach (Result<Tokens> tokenResult in tokenQueue)
                        {
                            yield return tokenResult;
                        }
                    }
                }

                if (!next.HasValue)
                    yield break;
            }

            checkForHeader = false;

            if (next.Value == '\r') 
            {
                // skip over the carriage return
                next = next.Remainder.ConsumeChar();
                continue;
            }

            if (next.Value == '\n')
            {
                // line break; check for a header token here
                next = next.Remainder.ConsumeChar();
                checkForHeader = true;
                continue;
            }

            if (next.Value == 'A')
            {
                var abcStart = next.Location;
                next = next.Remainder.ConsumeChar();
                if (next.HasValue && next.Value == 'B')
                {
                    next = next.Remainder.ConsumeChar();
                    if (next.HasValue && next.Value == 'C')
                    {
                        yield return Result.Value(Tokens.ABC, abcStart, next.Remainder);
                        next = next.Remainder.ConsumeChar();
                    }
                    else
                    {
                        yield return Result.Empty<Tokens>(next.Location, $"unrecognized `AB{next.Value}`");
                    }
                }
                else
                {
                    yield return Result.Empty<Tokens>(next.Location, $"unrecognized `A{next.Value}`");
                }
            }
            else if (next.Value == 'X')
            {
                yield return Result.Value(Tokens.X, next.Location, next.Remainder);
                next = next.Remainder.ConsumeChar();
            }
            else if (next.Value == 'Y')
            {
                yield return Result.Value(Tokens.Y, next.Location, next.Remainder);
                next = next.Remainder.ConsumeChar();
            }
            else if (next.Value == ':')
            {
                yield return Result.Value(Tokens.Colon, next.Location, next.Remainder);
                next = next.Remainder.ConsumeChar();
            }
            else if (next.Value == ' ')
            {
                yield return Result.Value(Tokens.Space, next.Location, next.Remainder);
                next = next.Remainder.ConsumeChar();
            }
            else
            {
                yield return Result.Empty<Tokens>(next.Location, $"unrecognized `{next.Value}`");
                next = next.Remainder.ConsumeChar(); // Skip the character anyway
            }
        }
    }
}

And you can call it like this:

var tokens = new MyTokenizer().Tokenize(input);
jtate
  • 2,612
  • 7
  • 25
  • 35
  • thanks for this, i'll test it when I'm back at my desk (lot of code for a seemingly simple thing though) – rednoyz Oct 31 '18 at 17:32
  • well, when building a custom tokenizer you have to write out everything, even the simple cases. I'm sure you could simplify some parts of it by adding some helper classes/functions, but regardless, it should accomplish what you need. – jtate Oct 31 '18 at 17:46
  • yes, this is great (thanks) but I realize that in my toy example I left out one important case, where the token is comprised of multiple characters: How to check, in a custom tokenizer, beyond the next character? For example, lets say there was one additional token for "ABC"... With tests: Test("ABC Y:", Tokens.ABC, Tokens.Space, Tokens.Y, Tokens.Colon); AND Test("X: ABC", Tokens.Header, Tokens.Space, Tokens.ABC); – rednoyz Nov 02 '18 at 16:23
  • Can `ABC` be a header? i.e. `Test("ABC: X", Tokens.Header, Tokens.X)`. And are `A`, `B`, and `C` each tokens by themselves? – jtate Nov 02 '18 at 17:49
  • For the sake of simplicity in this example, I think we can consider ABC a single token (no A,B,C tokens) that cannot be a header. – rednoyz Nov 03 '18 at 03:24
  • @rednoyz check my edit. It's possible that could be simplified, but writing it out like that will gives the ability to provide very specific error messages if the tokenizer fails. – jtate Nov 03 '18 at 04:36
  • Seems ok for one case (and I do hear what you say about error messages), but if you have 8-9 (or 20) of those tokens, that is some serious spaghetti. What I keep coming back to is the idea of first running a regular LINQ-style tokenizer, then in a 2nd step combining some of those tokens if needed. I've started on such an approach, but I can't figure how to combine 2 or more tokens in a existing (or new) TokenList while still properly maintaining the state for each – rednoyz Nov 03 '18 at 05:48
  • Yes you can definitely clean it up by creating helper classes and methods if need be, but your question was how to parse the tokens provided, which I've shown you how to do. Feel free to expand upon it on your own. – jtate Nov 03 '18 at 05:51