0

I'm looking to make a regular expression for some strings in C.

This is what i have so far:

Strings in C are delimited by double quotes (") so the regex has to be surrounded by \" \".

The string may not contain newline characters so I need to do [^\n] ( I think ).

The string may also contain double quotes or back slash characters if and only if they're escaped. Therefore [\\ \"] (again I think).

Other than that anything else goes.

Any help is much appreciated I'm kind of lost on how to start writing this regex.

Brian
  • 3,850
  • 3
  • 21
  • 37
knechtsr
  • 11
  • 1
  • 1
  • Are you trying to match all strings without newline characters that may or may not have escaped back slashes and escaped double quotes? – N Brown Sep 19 '17 at 01:58
  • Possible duplicate of [Regular expression for a string literal in flex/lex](https://stackoverflow.com/questions/2039795/regular-expression-for-a-string-literal-in-flex-lex) – Ken Y-N Sep 19 '17 at 01:58
  • Yes that is correct @ N Brown – knechtsr Sep 19 '17 at 02:25
  • @Ken Y-N that posts asks a similar question but none of the answer complete the question as it still allows for a newline character. – knechtsr Sep 19 '17 at 02:28
  • [This answer](https://stackoverflow.com/a/9260547/1270789) specifically says it is for single lines. – Ken Y-N Sep 19 '17 at 02:35
  • If you used that regex it would allow for the input of "\n" to where it should throw a Lexical Error. – knechtsr Sep 19 '17 at 02:37
  • So adapt it and exclude `\n`? Also, do you need to handle trigraphs? – melpomene Sep 19 '17 at 16:39
  • @keny-n: that's what the answer claims but it is wrong, notwithstanding the upvotes. As melpomene says, you just have to add `\n` to the character class of exclusions. – rici Sep 19 '17 at 16:45

1 Answers1

5

A simple flex pattern to recognize string literals (including literals with embedded line continuations):

["]([^"\\\n]|\\.|\\\n)*["]

That will allow

   "string with \
line continuation"

But not

"C doesn't support
 multiline strings"

If you don't want to deal with line continuations, remove the \\\n alternative. If you need trigraph support, it gets more irritating.

Although that recognizes strings, it doesn't attempt to make sense of them. Normally, a C lexer will want to process strings with backslash sequences, so that "\"\n" is converted to the two characters "NL (0x22 0x0A). You might, at some point, want to take a look at, for example, Optimizing flex string literal parsing (although that will need to be adapted if you are programming in C).

Flex patterns are documented in the flex manual. It might also be worthwhile reading a good reference on regular expressions, such as John Levine's excellent book on Flex and Bison.

rici
  • 234,347
  • 28
  • 237
  • 341