0

Motivating examples
good:

 SELECT   a, b,  c    d,e FROM t1

bad:

SE L ECT a, b, c    d,e FR OM t1

SELECTa, b, c    d,eFROMt1

So as you can see problem here is that some spaces are ok(between SELECT and a,b,c for example) and some are bad(SE L ECT) and some are neccessary(after/before keyword).

So my question is what idioms to use here since if I use space skipper with phrase_parse it will allow bad spaces and if I want to allow good spaces without a skipper parsers become littered with *char_(' ')

NoSenseEtAl
  • 28,205
  • 28
  • 128
  • 277
  • Sounds to me, what you're looking for is a regex; or better yet, a check of the result of your SQL query? – UKMonkey Nov 03 '16 at 10:49
  • this is mostly about parsing with boost spirit, this is just a toy example, real problems of this kind can not be nicely solved with regex – NoSenseEtAl Nov 03 '16 at 10:50
  • 2
    If you provide a logical definition what "good" or "bad" space means, as a technical specification, then it can be translated directly into code. But vague, nebulous concepts like "good" or "bad" is something that's not directly translatable into software. – Sam Varshavchik Nov 03 '16 at 11:01

2 Answers2

3

You need to mark your keywords as qi::lexeme[].

Besides, you probably want something like boost::spirit::repository::qi::distinct to avoid parsing SELECT2 as SELECT followed by 2.\

See e.g.

Community
  • 1
  • 1
sehe
  • 374,641
  • 47
  • 450
  • 633
  • nice answer, and just to give some more info to it I think in my example this from your other answer is true: "If you're building a really robust general-purpose language grammar, this is about the point where you should consider using a Spirit Lexer. " Rage in comment reply if I am wrong. :) – NoSenseEtAl Nov 04 '16 at 06:20
1

What you're looking for is, well, parsing.

It's not about accepting/rejecting "good" or "bad" spaces. It is about trying to recognize what's entered, and rejecting it if you can't.

In this case, let's start with a (thoroughly simplified) grammar for the statement in question:

select_statement ::= 'select' field_list 'from' table

So, you read in the first token. If it's SE or SELECTa, you reject the statement as invalid, because neither of those fits your grammar. Almost any decent parser generator (including, but certainly not limited to, Spirit) makes this fairly trivial--you specify what is acceptable, and what to do if the input is not acceptable, and it deals with invoking that for input that doesn't fit the specified grammar.

As for how you do the tokenization to start with, it's typically pretty simple, and usually can be based on regular expressions (e.g., many languages have been implemented using lex and derivatives like Flex, which use regexen to specify tokenization).

For something like this, you directly specify the keywords for your language, so you'd have something that says when it matches 'select', it should return that as a token. Then you have something more general for an identifier that typically runs something like `[_a-zA-Z][_a-zA-Z0-9]*' ("an identifier starts with an underscore or letter, followed by an arbitrary number of underscores, letters, or digits"). In the cases above, this would be entirely sufficient to find and return the "SE" and "SELECTa" as the first tokens in the "bad" examples.

Your parser would then detect that the first thing it received was an identifier instead of a key word, at which point it would (presumably) be rejected.

Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111
  • One of the few cases where "you can use a regular expression to solve your problem" does not leave you with two problems, because nobody sane uses an insane pattern for "identifier". – Yakk - Adam Nevraumont Nov 03 '16 at 11:34
  • 1
    The spirit translation of this approach would be to use a Spirit Lex tokenizer. I won't recommend it as it complicates use of Spirit to the point I don't think it hits a sweet spot any more – sehe Nov 03 '16 at 13:24