0

When parsing in superpower, how to match a string only if it is the first thing in a line?

For example, I need to match the A colon in "A: Hello Goodbye\n" but not in "Goodbye A: Hello\n"

rednoyz
  • 1,318
  • 10
  • 24
  • are you trying to parse multiple lines of text like "A: Hello Goodbye" ? And what is your expected output? Key/value pairs e.g. `Key = "A"` and `Value = "Hello Goodbye"` ? Also, do you expect "Goodbye A: Hello" to fail parsing? – jtate Oct 22 '18 at 15:29
  • I guess that depends if its the tokenizer or parser. If the tokenizer (which I think is the better solution), then I'd want anything that matches the above regex to be a token. – rednoyz Oct 22 '18 at 15:37
  • It really depends on your expected output. What data are you trying to extract out of this? – jtate Oct 22 '18 at 15:39
  • By way of context, each command in the language is a single line (ended by a line-break), and certain characters/strings have special meaning if they start the line, but not if they occur later. So if it happens in the parser, then it might return an Actor object which contains the string "A:", followed by a FreeText object which contains the string "Hello Goodbye". In the second case, the whole thing would be FreeText("Goodbye A: Hello") since the Actor parser would fail. – rednoyz Oct 22 '18 at 15:41
  • I think I understand, but to build a parser like this, you'd need to provide a more comprehensive example. Could you update the question to include that, along with the classes you'd want the output parsed into? – jtate Oct 22 '18 at 15:48
  • Sure, I can add that tomorrow (I was imagining those two classes, Actor and FreeText, to each have only a single string member var). But you think it is not possible in the tokenizer? – rednoyz Oct 22 '18 at 15:57
  • it's definitely possible, it's just a different approach if you want to have it as a token vs parsing it – jtate Oct 22 '18 at 16:04
  • as mentioned, this would ideally happen in the tokenizer – rednoyz Oct 22 '18 at 16:12
  • just curious, why do you need this done in the tokenizer? – jtate Oct 22 '18 at 20:38
  • just seemed cleaner to me, but I'm open to either – rednoyz Oct 23 '18 at 00:30
  • Check the parser version [here](https://gist.github.com/dhowe/0bd17c5a7658ebbc817a1ee5a89aeb19) – rednoyz Oct 23 '18 at 13:13
  • It looks like you're moving away from the token approach and going with a parser. What issues are you having now? – jtate Oct 23 '18 at 14:17

3 Answers3

1

Using your example here, I would change your ActorParser and NodeParser definitions to this:

public readonly static TokenListParser<Tokens, Node> ActorParser =
    from name in NameParser
    from colon in Token.EqualTo(Tokens.Colon)
    from text in TextParser
    select new Node {
        Actor = name + colon.ToStringValue(),
        Text = text
    };

public readonly static TokenListParser<Tokens, Node> NodeParser =
    from node in ActorParser.Try()
        .Or(TextParser.Select(text => new Node { Text = text }))
    select node;

I feel like there is a bug with Superpower, as I'm not sure why in the NodeParser I had to put a Try() on the first parser when chaining it with an Or(), but it would throw an error if I didn't add it.

Also, your validation when checking input[1] is incorrect (probably just a copy paste issue). It should be checking against "Goodbye A: Hello" and not "Hello A: Goodbye"

jtate
  • 2,612
  • 7
  • 25
  • 35
  • Thanks for the update. I'm accepting this though what I realize I really need is the tokenizer version, which I've [posted here](https://stackoverflow.com/questions/53029386/superpower-match-a-string-with-tokenizer-only-if-it-begins-a-line) along with test-cases... – rednoyz Oct 28 '18 at 07:46
0

Unless RegexOptions.Multiline is set, ^ matches the beginning of a string regardless of whether it is at the beginning of a line.

You can probably use inline (?m) to turn on multiline:

static TextParser<Unit> Actor { get; } =
  from start in Span.Regex(@"(?m)^[A-Za-z][A-Za-z0-9_]+:")
  select Unit.Value;
Craig.Feied
  • 2,617
  • 2
  • 16
  • 25
  • passing the RegexOptions.Multiline option doesn't fix the problem: Span.Regex(@"^[A-Za-z][A-Za-z0-9_]*:", RegexOptions.Multiline) – rednoyz Oct 08 '18 at 18:21
  • Hmmm -- if multiline doesn't solve it then most likely the `Span` you are receiving is a slice that's not what you think it is (doesn't correspond to a line). Try breaking on your code and inspect the span. If that doesn't solve your problem, then post a minimal working example that demonstrates the failure, so we can run it and help you sort out the problem. – Craig.Feied Oct 08 '18 at 19:42
  • Ok, so seems that if the line is "1 abc:" and Ignore(Span.WhiteSpace) is set, then the tokenizer consumes the first token ('1'), then ignores the white space as directed, then sees the "abc:" as starting from position 0, thus matching. But what I want is to only match "abc:" if it is the first token ... How to do this? – rednoyz Oct 09 '18 at 14:32
  • You can't do that from *inside* the tokenizer because it only sees the remainder after previous tokens have been processed. Probably it would help if you explained more of what you are trying to do at a higher level, with an example of the full input you are expecting and the exact behavior you want to accomplish. The act of tokenizing breaks the input into multiple tokens based on rules; if you want to select a particular token you would do so after the tokenizer is done. – Craig.Feied Oct 09 '18 at 16:09
  • Can you post a minimal demonstration program that compiles and executes to exhibit the behavior you are describing? – Craig.Feied Oct 09 '18 at 17:18
  • I suppose I can do it in the parser equally well, but its still not clear to me how to do so. Each command in the language in question is a single line (ended by a line-break), and certain characters/strings have special meaning if they start the line, but not if they occur later. For example, I need a parser that will match the A colon in "A: Hello Goodbye\n" but not in "Goodbye A: Hello\n". – rednoyz Oct 09 '18 at 19:08
  • see [this question](https://stackoverflow.com/questions/53029386/superpower-match-a-string-with-tokenizer-only-if-it-begins-a-line) for the tokenizer case, which I still think should be possible – rednoyz Oct 28 '18 at 07:48
0

I have actually done something similar, but I do not use a Tokenizer.

private static string _keyPlaceholder;

private static TextParser<MyClass> Actor { get; } =
    Span.Regex("^[A-Za-z][A-Za-z0-9_]*:")
        .Then(x =>
             {
                 _keyPlaceholder = x.ToStringValue();
                 return Character.AnyChar.Many();
             }
         ))
    .Select(value => new MyClass { Key = _keyPlaceholder, Value = new string(value) });

I have not tested this, just wrote it out by memory. The above parser should have the following:

myClass.Key = "A:"
myClass.Value = " Hello Goodbye"
HuntK24
  • 158
  • 2
  • 13