5

How would one allow for single and double quotes, as well as unicode characters inside of a PEG.js grammar definition? To be more specific, I'd like to be able to capture strings that can contain both single and double quotes (will most likely have to be \ escaped) and all unicode characters.

At the moment I have something like the following:

_ name:$(PROP_ASCII+) CHAR_SQ val:$(PROP_ASCII_INNER*) CHAR_SQ

which would capture something like

key'value'

PROP_ASCII* is defined as

PROP_ASCII = [!-&(-<>-~] PROP_ASCII_INNER = [ -&(-~]

So this works fine and dandy if value contains standard ASCII characters and does not contain single quotes... But I'd like to support what I've described above, so something like this would become possible:

key'somé\'value\'☂'

Thoughts?

Speedy
  • 467
  • 1
  • 6
  • 16

2 Answers2

16

This example should get you going. It supports both single and double quotes which can also be escaped within the string.

Try it in the online editor.

Value
  = '"' chars:DoubleStringCharacter* '"' { return chars.join(''); }
  / "'" chars:SingleStringCharacter* "'" { return chars.join(''); }

DoubleStringCharacter
  = !('"' / "\\") char:. { return char; }
  / "\\" sequence:EscapeSequence { return sequence; }

SingleStringCharacter
  = !("'" / "\\") char:. { return char; }
  / "\\" sequence:EscapeSequence { return sequence; }

EscapeSequence
  = "'"
  / '"'
  / "\\"
  / "b"  { return "\b";   }
  / "f"  { return "\f";   }
  / "n"  { return "\n";   }
  / "r"  { return "\r";   }
  / "t"  { return "\t";   }
  / "v"  { return "\x0B"; }
dan
  • 2,378
  • 18
  • 17
  • 1
    Seeing as you're doing the check `!('"' / "\\")` inside of the (Double/Single)StringCharacter definition, simply changing the unicode sequences `[\x20-\x21\x23-\x5B\x5D-\uFFFF]` to `.` works. This way it also ingests unique characters `> \uFFFF`. – Speedy Dec 02 '15 at 00:15
  • @r3oath - ah! Well that makes sense :) Updated. Thanks. – dan Dec 02 '15 at 11:01
  • I want to upvote this 100 times. Thank you very much! – TommyMason Jul 03 '17 at 08:23
1

Found a solution inside this example file PEG.js JSON grammar. Unicode strings with escape characters can be defined as such:

string "string"
  = quotation_mark chars:char* quotation_mark { return chars.join(""); }

char
  = unescaped
  / escape
    sequence:(
        '"'
      / "\\"
      / "/"
      / "b" { return "\b"; }
      / "f" { return "\f"; }
      / "n" { return "\n"; }
      / "r" { return "\r"; }
      / "t" { return "\t"; }
      / "u" digits:$(HEXDIG HEXDIG HEXDIG HEXDIG) {
          return String.fromCharCode(parseInt(digits, 16));
        }
    )
    { return sequence; }

escape         = "\\"
quotation_mark = '"'
unescaped      = [\x20-\x21\x23-\x5B\x5D-\u10FFFF]
Speedy
  • 467
  • 1
  • 6
  • 16
  • 1
    The unescaped rule needs a bit of work. Try typing `"\u10ffff"` into a JS console => `"ჿff"`. That rule supports characters up to `\u10FF`, plus `F` and `F`. E.g., it doesn't support the unicode snowman ☃ (`\u2603`) since `\u2603` > `\u10FF`. Max single character you can represent in JS is 16 bit `\uFFFF`. – dan Dec 01 '15 at 11:36