59

I'm experimenting to learn flex and would like to match string literals. My code currently looks like:

"\""([^\n\"\\]*(\\[.\n])*)*"\""        {/*matches string-literal*/;}

I've been struggling with variations for an hour or so and can't get it working the way it should. I'm essentially hoping to match a string literal that can't contain a new-line (unless it's escaped) and supports escaped characters.

I am probably just writing a poor regular expression or one incompatible with flex. Please advise!

Lesmana
  • 25,663
  • 9
  • 82
  • 87
Thomas
  • 929
  • 2
  • 9
  • 10
  • Thanks so much everyone! All your comments were very helpful. The regex that has finally worked for me is a variant of the one used in the C specification linked by codadict (and explained by Jonathan): \"(\\(.|\n)|[^\\"\n])*\" – Thomas Jan 11 '10 at 04:12
  • 1
    Since you found Jonathan's answer helpful, consider adding an upvote for his answer. – codaddict Jan 11 '10 at 04:17
  • By the way: nowhere in your question do you specify what language's string literals you're interested in. It's a very good idea to put the language you're asking about in one of the question's tags. – Laurence Gonsalves Jan 11 '10 at 05:14

6 Answers6

132

A string consists of a quote mark

"

followed by zero or more of either an escaped anything

\\.

or a non-quote character, non-backslash character

[^"\\]

and finally a terminating quote

"

Put it all together, and you've got

\"(\\.|[^"\\])*\"

The delimiting quotes are escaped because they are Flex meta-characters.

Dan O
  • 4,323
  • 5
  • 29
  • 44
Jonathan Feinberg
  • 44,698
  • 7
  • 80
  • 103
  • 4
    This doesn't handle escaping, unfortunately. So this would incorrectly lex `"\""` – Paul Biggar Aug 01 '10 at 12:55
  • 5
    You must have missed "zero or more of an escaped anything"? – Jonathan Feinberg Aug 02 '10 at 01:48
  • 6
    There are several problems with this answer. First, it's not a valid flex pattern. The leading and trailing double-quotes need to be escaped because otherwise flex treats them as meta-characters. So the pattern should be (perhaps) \"(\\.|[^"])*\" . Second, that pattern still doesn't work. For example, it gets this input wrong: "\\\\" . Third, it doesn't meet the original question's requirement of disallowing newlines. – rob mayoff Jul 27 '11 at 01:13
  • 1
    As a regex, this is totally correct. Except for the newline thing, which is easily fixed by replacing `.` with `[^\n]` and `[^"]` with `[^"\n]`. It certainly should match `"\\\\"` too, since the repetition will match the quote `"`, then the escaped slash `\\ `, then the next escaped slash `\\ `, then the terminating quote `"`. The pattern certainly works for me outside of the scope of flex. – d11wtq Nov 15 '11 at 10:38
  • 7
    It doesn't matter whether it works outside the scope of flex. The question was about flex. If the lexer produced by flex sees `"\\\\"foo"`, it will match the entire input, instead of just matching the `"\\\\"` part, because the character class doesn't exclude backslashes. – rob mayoff Jan 09 '13 at 04:48
  • 1
    @robmayoff is correct. This will incorrectly match all of `"\\"a"` (as: `quote`, `not-quote`, `backslash-dot-anything`, `not-quote`, `quote`). The regex should say `[^"\\]`, not `[^"]`. – Cornstalks Apr 19 '14 at 17:36
  • There's actually one more subtlety here since . won't match a \n. So the final flex pattern needed with escaping is \"(\\.|\\\n|[^"\\])*\" – Jeff Johnson May 21 '15 at 18:03
  • 1
    lol upvoted not b/c it's in flex, but I've been looking for a regex that matches a string literal. – BrockLee Jun 19 '17 at 23:01
29

For a single line... you can use this:

\"([^\\\"]|\\.)*\"  {/*matches string-literal on a single line*/;}
Pete
  • 299
  • 3
  • 2
9

How about using a start state...

int enter_dblquotes = 0;

%x DBLQUOTES
%%

\"  { BEGIN(DBLQUOTES); enter_dblquotes++; }

<DBLQUOTES>*\" 
{ 
   if (enter_dblquotes){
       handle_this_dblquotes(yytext); 
       BEGIN(INITIAL); /* revert back to normal */
       enter_dblquotes--; 
   } 
}
         ...more rules follow...

It was similar to that effect (flex uses %s or %x to indicate what state would be expected. When the flex input detects a quote, it switches to another state, then continues lexing until it reaches another quote, in which it reverts back to the normal state.

t0mm13b
  • 34,087
  • 8
  • 78
  • 110
  • 2
    @Samoz: Not really, it's actually used in languages where string literals are used, it eats up what's between a beginning quote and an end quote, even if there's extra quotes inside it hence the usage of switching states in order to chew up the quotes... – t0mm13b Jun 04 '10 at 23:39
  • 6
    The flex manual contains a full example (in terms of flex usage) of parsing C-style strings: http://flex.sourceforge.net/manual/Start-Conditions.html . Search for "quoted strings" on that page. – rob mayoff Jul 27 '11 at 01:22
3

Paste my code snippet about handling string in flex, hope inspire your thinking.

Use Start Condition to handle string literal will be more scalable and clear.

%x SINGLE_STRING

%%

\"                          BEGIN(SINGLE_STRING);
<SINGLE_STRING>{
  \n                        yyerror("the string misses \" to termiate before newline");
  <<EOF>>                   yyerror("the string misses \" to terminate before EOF");
  ([^\\\"]|\\.)*            {/* do your work like save in here */}
  \"                        BEGIN(INITIAL);
  .                         ;
}
pwxcoo
  • 2,903
  • 2
  • 15
  • 21
2

This is what we use in Zolang for single line string literals with embedded templates ${...}

\"(\$\{.*\}|\\.|[^\"\\])*\"

0

An answer that arrives late but which can be useful for the next one who will need it:

\"(([^\"]|\\\")*[^\\])?\"
Floern
  • 33,559
  • 24
  • 104
  • 119
david
  • 11