4

Background:

I am implementing a language similar to Ruby, called Sapphire, as a way to try out some Ideas I have on concurrency in programming languages. I am trying to copy Ruby's double quoted strings with embedded code which I find very useful as a programmer.

Question:

How do any of the Ruby interpreters turn a double quotes string with embedded code into and AST?

eg:

puts "The value of foo is #{@foo}."

puts "this is an example of unmatched braces in code: #{ foo.go('}') }"

Details:

The problem I have is how to decide which } closes the code block. Code blocks can have other braces within them and with a little effort they can be unmatched. The lexer can find the beginning of a code block in a string, but without the aid of the parser, it cannot know for sure which character is the end of that block.

It looks like Ruby's parse.y file does both the lexing and parsing steps, but reading that thing is a nightmare it is 11628 lines long with no comments and lots of abbr.

John F. Miller
  • 26,961
  • 10
  • 71
  • 121
  • Also, fwiw, string literals have to solve a similar tricky problem: `%Q{{hi}} #=> "{hi}"` – Alex Wayne Jan 30 '14 at 23:13
  • Alex, your example will give an error if you type it into IRB. Ruby will end the string on the first close brace. That is why you are allowed to choose your symbol in this construction. ie %Q|{hi}| The symbol that opens the construct is the one that closes it. The exceptions are {[(< which are closed by >)]}. – John F. Miller Jan 31 '14 at 00:31
  • Michael: heredocs are easy. The opening symbol *cannot* occur at the start of a line except that it closes the the heredoc. If you find the terminating symbol it MUST be the end. – John F. Miller Jan 31 '14 at 00:34
  • In the pry (another repl), it doesn't show any error and it shouldn't. I made this gist, as I cannot explain it easily: https://gist.github.com/nedzadarek/8744476 Tested on the pry on Ruby version 1.9.3 and 2.0 – Darek Nędza Jan 31 '14 at 22:26
  • As for heredocs, closing symbol must appear **alone** in the new line without any other characters. Other occurrences of **symbol** are valid. So considering `ST` is that **symbol** `ST ` (see the space after `ST`) won't close the heredoc. 2. There is special syntax: `var1=<<-ST` that lets you put spaces before terminating string (` ST` is valid). 3. There can be multiple occurrences of heredocs, and `a,b=< – Darek Nędza Jan 31 '14 at 22:42
  • FYI: there are already two Ruby-inspired languages named Sapphire. – Jörg W Mittag Feb 03 '14 at 15:13

5 Answers5

2

True, Yacc files can be a bit daunting to read at first and parse.y is not the best file to start with. Have you looked at the various string production rules? Do you have any specific questions?

As for the actual parsing, it's indeed not uncommon that lexers do also parse numeric literals and strings, see e.g. the accepted answer to a similar question here on SO. If you approach things this way, it's not too hard to see how to go about it. Hitting #{ inside a string, basically starts a new parsing context that gets parsed as an expression again. This means that the first } in your example can't be the terminating one for the interpolation, since it's part of a literal string within the expression. Once you reach the end of the expression (keep in mind expression separators like ;), the next } is the one you need.

Community
  • 1
  • 1
Michael Kohl
  • 66,324
  • 14
  • 138
  • 158
1

This is not a complete answer, but I leave it in hopes that it might be useful either to me or one who follows me.

Matz gives a pretty detailed rundown of the yylex() function of parse.y in chapter 11 of his book. It does not directly mention strings, but it does describe how the lexer uses lex_state to resolve several locally ambiguous constructs in Ruby.

A reproduction of an English translation of this chapter can be found here.

John F. Miller
  • 26,961
  • 10
  • 71
  • 121
1

Dart also supports expressions interpolated into strings like Ruby, and I've skimmed a few parsers for it. I believe what they do is define separate tokens for a string literal preceding interpolation and a string literal at the end. So if you tokenize:

"before ${the + expression} after"

You would get tokens like:

STRING_START "before "
IDENTIFIER   the
PLUS
IDENTIFIER   expression
STRING       " after"

Then in your parser, it's a pretty straightforward process of handling STRING_START to parse the interpolated expression(s) following it.

munificent
  • 11,946
  • 2
  • 38
  • 55
1

Please bear in mind that they don't have to (create an AST at compile time).

Ruby strings can be assembled at runtime and will interpolate correctly. Therefore all the parsing and evaluation machinery has to be available at runtime. Any work done at compile time in that sense could be considered an optimisation.

So why does this matter? Because there are very effective stack-based techniques for parsing and evaluating expressions that do not create or decorate an AST. The string is read (parsed) from left to right, and as embedded tokens are encountered they are either evaluated or pushed on a stack, or cause stack contents to be popped and evaluated.

This is a simple technique to implement provided the expressions are relatively simple. If you really want the full power of the language inside every string, then you need the full compiler at runtime. Not everyone does.

Disclosure: I wrote a commercial language product that does exactly this.

david.pfx
  • 10,520
  • 3
  • 30
  • 63
0

Our Ruby parser (see my bio) treats Ruby "strings" as complex objects having lots of substructures, including string start and end tokens, bare string literal fragments, lots of funny punctuation sequences representing the various regexp operators, and of course, recursively, most of Ruby itself for expressions nested inside such strings.

This is accomplished by allowing the lexer to detect and generate such string fragments in a (for Ruby, many) special lexing modes. The parser has a (sub)grammar that defines valid sequences of tokens. And that kind of parsing solves OP's original problem; the parser knows whether a curly brace matches other curly braces from the regexp content, and/or if the regexp has been completely assembled and the curly brace is a matching block end.

Yes, it builds an AST of the Ruby code, and of the regexps.

The purpose of all this is to allow us to build analyzers and transformers of Ruby code. See https://softwarerecs.stackexchange.com/q/11779/101

Community
  • 1
  • 1
Ira Baxter
  • 93,541
  • 22
  • 172
  • 341