Superpower: Match any not white character except for tokenizer

Question

I would like to use the Nuget package Superpower to match all non-white characters unless it is a tokenized value. E.g.,

var s = "some random text{variable}";

Should result in:

["some", "random", "text", "variable"]

But what I have now is:

["some", "random", "text{variable}"]

The parsers for it look like:

    public static class TextParser
    {
        public static TextParser<string> EncodedContent =>
            from open in Character.EqualTo('{')
            from chars in Character.Except('}').Many()
            from close in Character.EqualTo('}')
            select new string(chars);

        public static TextParser<string> HtmlContent =>
            from content in Span.NonWhiteSpace
            select content.ToString();
    }

Of course I'm returning the strings in another variable in the parser. But this just simplified.

Hopefully that is enough information. If not I do have the whole repo up on Github. https://github.com/jon49/FlowSharpHtml

Why not simply replace `{` `}` with spaces and proceed afterwards? — Patrick Artner, Jul 04 '19 at 18:40
I want to use a parsing engine to do it. And it is more complicated then what the example shows. The example is simplified compared to what I am really doing :-) — Jon49, Jul 04 '19 at 19:11

score 2 · Accepted Answer · answered Jul 09 '19 at 17:56

There could be many different ways to parse your input, and depending on how much more complex your inputs really are (as you say you've simplified it), you will probably need to tweak this. But the best way to approach using Superpower is to create small parsers and then build upon them. See my parsers and their descriptions below (each one building upon the previous):

/// <summary>
/// Parses any character other than whitespace or brackets.
/// </summary>
public static TextParser<char> NonWhiteSpaceOrBracket =>
    from c in Character.Except(c => 
        char.IsWhiteSpace(c) || c == '{' || c == '}',
        "Anything other than whitespace or brackets"
    )
    select c;

/// <summary>
/// Parses any piece of valid text, i.e. any text other than whitespace or brackets.
/// </summary>
public static TextParser<string> TextContent =>
    from content in NonWhiteSpaceOrBracket.Many()
    select new string(content);

/// <summary>
/// Parses an encoded piece of text enclosed in brackets.
/// </summary>
public static TextParser<string> EncodedContent =>
    from open in Character.EqualTo('{')
    from text in TextContent
    from close in Character.EqualTo('}')
    select text;

/// <summary>
/// Parse a single content, e.g. "name{variable}" or just "name"
/// </summary>
public static TextParser<string[]> Content =>
    from text in TextContent
    from encoded in EncodedContent.OptionalOrDefault()
    select encoded != null ? new[] { text, encoded } : new[] { text };

/// <summary>
/// Parse multiple contents and flattens the result.
/// </summary>
public static TextParser<string[]> AllContent =>
    from content in Content.ManyDelimitedBy(Span.WhiteSpace)
    select content.SelectMany(x => x.Select(y => y)).ToArray();

Then to run it:

string input = "some random text{variable}";
var result = AllContent.Parse(input);

Which outputs:

["some", "random", "text", "variable"]

The idea here is to build a parser to parse out one content, then leveraging Superpower's built in parser called ManyDelimitedBy to kind of simulate a "split" on the whitespace in between the real content you're looking to parse out. This results in an array of "content" pieces.

Also you may want to take advantage of Superpower's token functionality to produce better error messages when parsing fails. It's a slightly different approach, but take a look at this blog post to read more about how to use the tokenizer, but it's completely optional if you don't need more friendly error messages.

Mitja · Answer 2 · 2019-07-04T20:28:23.323

0

Maybe you can write it simplier, but that was the first idea I had. I hope it helps:

    Regex tokenizerRegex = new Regex(@"\{(.+?)\}");
    var s = "some random text{variable}";
    string[] splitted = s.Split(' ');
    List<string> result = new List<string>();
    foreach (string word in splitted)
    {
        if (tokenizerRegex.IsMatch(word)) //when a tokenized value were recognized
        {
            int nextIndex = 0;
            foreach (Match match in tokenizerRegex.Matches(word)) //loop throug all matches
            {
                if (nextIndex < match.Index - 1) //if there is a gap between two tokens or at the beginning, add the word
                    result.Add(word.Substring(nextIndex, match.Index - nextIndex));
                result.Add(match.Value);
                nextIndex = match.Index + match.Length; //Save the endposition of the token
            }
        }
        else
            result.Add(word);//no token found, just add the word.
    }
    Console.WriteLine("[\"{0}\"]",string.Join("\", \"", result));

Examples

Text: some random text{variable}

["some", "random", "text", "{variable}"]

Text: some random text{variable}{next}

["some", "random", "text", "{variable}", "{next}"]

Text: some random text{variable}and{next}

["some", "random", "text", "{variable}","and", "{next}"]

edited Jul 04 '19 at 20:28

answered Jul 04 '19 at 19:56

Mitja

863
5
22

thanks for your thoughtful answer but I am really looking for an answer with a formal parser. As the complexity of the code increases it becomes more and more important to use formal processes like parser libraries to parse text. The question that I posed is a simplification of what I need to parse the text for. It is more complex than that and will continue to grow. I'm really looking for a solution that will use the Superpower library. – Jon49 Jul 05 '19 at 05:14
Updated my question to make it more explicit that I'm looking for a solution using the nuget package Superpower to solve this problem. Thanks again! – Jon49 Jul 05 '19 at 05:17
Sorry, never worked with superpower. You shouldn't simplify your case on SO. Hope someone can help. – Mitja Jul 06 '19 at 06:05
It's standard to simplify on SO and best practice for asking questions and approaching hard problems. Otherwise you would need a whole book just to show what you need! Yeah people yell at you for not simplifying :-) – Jon49 Jul 07 '19 at 07:16
1

It was related to your comment to your initial post which reads like "I ask a question and doing something different" and were not meant like you should post your whole code. Nevermind, I hope someone can answer your question. Maybe you could ask the creator of the nuget package. – Mitja Jul 08 '19 at 08:18

Superpower: Match any not white character except for tokenizer

2 Answers2