1

Possible Duplicate:
How to match a quoted string with escaped quotes in it?

I'm building a parser and I need a method that matches a string: The string starts and ends with a ". Everything until the second ", that is not escaped, should be matched. Escaped means that there's an odd amount of backslashes before it (e.g. \" or \\\").

Some examples, the part before => is the input and the other part is what the method should extract:

"Hello World" => "Hello World"
"Hello" World => "Hello"
"Hello \"World" => "Hello \" World"
"Hello \\" World => "Hello \\"

I guess in most programming langs the backslashes need to be escaped to have an actual backslash in the string. That means that one would need two backslashes to get one real backslash inside the string. The above examples ignore this.

I came up with this regular expression (I'm using Ruby):

/
  "
  (?:
    (?:\\{2})* # an even amount of backslashes
    \\ # followed by a single backslash: odd amount of backslashes
    "
    |
    [^"]
  )*
  "
/x

However, it doesn't work correctly with the third example string, or any string thas has a backslash to escape a ". I I noticed that when I remove the * in the third last line then escaping the " works, but it doesn't work correctly with example 4.

I spent a long time trying to fix this regex, but I couldn't figure out how to. I know the question might be a little overwhelming, so tell me if you need more information!

Community
  • 1
  • 1
js-coder
  • 8,134
  • 9
  • 42
  • 59
  • your 4th example is wrong..it is returning "Hello \\" – Anirudha Nov 03 '12 at 19:30
  • 1
    @Fake.It.Til.U.Make.It What do you mean it's wrong? "Hello \\" should be returned. – js-coder Nov 03 '12 at 19:44
  • @Bergi I spent some time looking for a similiar question, but I didn't find that one. – js-coder Nov 03 '12 at 19:45
  • @dotweb: Did it help? I'm not sure, there were many that dealt with escaped quotes in regex - yet I didn't find the high-voted catch-all master-solution one… – Bergi Nov 03 '12 at 19:49
  • For the record, building a one-pass parser based on regular expressions will cause you more work than you expect. Most parsers use two passes, one to identify lexically useful components (like digits, letters, and symbols like double-quotation marks) and a second pass to recognize that a double-quote followed by some number of characters followed by a double-quote should be considered a "string" (irrespective of character escapement.) – Rob Raisch Nov 03 '12 at 21:13
  • @RobRaisch It's my first parser, so I'm probably doing a lot of stuff different than it's usually done. :) Can you point me to some good literature? I couldn't find a lot of good stuff about building parsers. – js-coder Nov 03 '12 at 21:24
  • You might check out http://stackoverflow.com/questions/2842809/lexers-vs-parsers as well as http://treetop.rubyforge.org/ which is an excellent parser generator in Ruby. – Rob Raisch Nov 03 '12 at 21:26

1 Answers1

1

Try this:

"(\\[\\"]|[^\\"])*"

A Rubular demo: http://rubular.com/r/Ql9RQ4pex6

A quick break down:

"            # a quote
(            # start group 1
  \\[\\"]    #   an escaped quote or backslash
  |          #   OR
  [^\\"]     #   any char except a quote or backslash
)*           # end group 1 and repeat it zero or more times
"            # a quote
Bart Kiers
  • 166,582
  • 36
  • 299
  • 288