Flex match string literal, escaping line feed

Question

I am using flex to try and match C-like, simplified string literals. A regular expression as such:

\"([^"\\]|\\["?\\btnr]|\\x{HEXDIG}{HEXDIG})*\"

will match all one-line string literals I am interested in.

A string literal cannot contain a non-escaped backslash. A string literal also cannot contain a literal line feed (0x0a) unless it is escaped by a backslash, in which case the line feed and any following spaces and tabulations are ignored..

For example, assuming {LF} is an actual line feed and {TAB} an actual tabulation (I could not format it better than that).

In: "This is an example \{LF}{TAB}{TAB}{TAB}of a confusing valid string"

Token: "This is an example of a confusing valid string"

My first idea was to use a starting state, a trailing context and yymore() to match what I want and check for errors giving something like the following:

...
%%
\" { BEGIN STRING; yymore(); }
<STRING>{
    \n { /* ERROR HERE! */ }
    <<EOF>> { /* ERROR HERE AS WELL */ }
    
    ([^"\\]|\\["?\\btnr]|\\x{HEXDIG}{HEXDIG})* { 
          /* String ok up to here*/ 
          yymore(); 
    }
    \\\n[ \t]* { 
         /*Vadid inside a tring but needs to be ignored! */ 
         yymore(); 
    }
    \" { /* Full string matched */ BEGIN INITIAL;}
    .|\n { \* Anything else is considered an error *\  }
}
%%
...

Is there a way to do what I want in the way I am trying to do it? Is there instead any other 'standard' maybe method provided by flex that I just stupidly have not though of? This does not look to me like an uncommon use case. Should I just match the strings separately (beginning to before , after whitespace to end) and concatenate them. This is a bit complicated to do since a string can be decomposed into an arbitrary number of lines using backslashes.

If you really need it to be correct, it might be a good idea take all the features you need from a C language `flex` lexer, assuming the license permits it. There's a couple of edge cases that are tricky. — Neil, Feb 20 '21 at 23:06

rici · Accepted Answer · 2021-02-21T03:41:46.123

If all you want to do is to recognise a string literal, there's no need for start conditions. You can use some variant of the simple pattern which you'll find in many answers:

    ["]({normal}|{escape})*["]

(I used macros to make the structure clear, although in practice I would hardly ever use them.)

"Normal" here means any character without special significance in a string. In other words, any character other than " (which ends the literal), \ (which starts an escape sequence, or newline (which is usually an error although some languages allow newlines in strings). In other words, [^"\n\\] (or something similar).

"escape" would be any valid escape sequence. If you didn't want to validate the escape sequence, you could just match a backslash followed by any single character (including newline): \\(.|\n). But since you do seem to want to validate, you'd need to be explicit about the escape sequences you're prepared for:

    \\([\n\\btnr"]|x[[:xdigit:]]{2})

But all that only recognises valid string literals. Invalid string literals won't match the pattern, and will therefore fall back to whatever you're using as a fallback rule (matching only the initial "). Since that's practically never what you want, you need to add a second rule which detects error. The easiest way to write the second rule is ["]({normal}|{escape})*, i.e. the valid rule without the final double quote. That will only match erroneous string literals because of (f)lex's maximal munch rule: a valid string literal has a longer match with the valid rule than with the error rule (because the valid rule's match includes the final double quote).

In real-life lexical scanners (as opposed to school exercises), it's more common to expect that the lexical scanner will actually resolve the string literal into the actual bytes it represents, by replacing escape sequences with the corresponding character. That is generally done with a start condition, but the individual patterns are more focussed (and there are more of them). For an example of such a parser, you could look at these two answers (and many others):

I have found a solution to my problem and it actually integrates well with what you proposed. What I was trying however was not only to match the escaped sequnce: " ... \{LF}[{SPACE}|{TAB}]* ..." which can be done with your one line, but also ignore LF, SPACE and TAB in the resulting token i.e remove any whitespace from \ up to the next character. This is possible with the start conditions I guess as the two other answers show. You also made me think that using plain C is a bad idea and I should switch to C++. Anyway, thanks! — Desperados, Feb 21 '21 at 15:57
@Desperados: Cool. Yes, if you want to remove splices, you will be better off using a start condition (the same as if you want to change `\n` into a real newline character). (Note: `[[:blank:]]` is a predefined character class containing only space and tab. `[[:space:]]` contains all C whitespace: space, tab, newline, carriage return, form feed, and vertical tab. Using standard character classes is better style than rolling your own IMHO. (I assume C/Posix locale. Locales are permitted to add to the list, but these days single-byte non-unicode locales are pretty rare.) — rici, Feb 21 '21 at 16:43

Flex match string literal, escaping line feed

1 Answers1