37

Is it a lexer's job to parse numbers and strings?

This may or may not sound dumb, given that fact that I'm asking whether a lexer should parse input. However, I'm not sure whether that's in fact the lexer's job or the parser's job, because in order to lex properly, the lexer needs to parse the string/number in the first place, so it would seem like code would be duplicated if the parser does this.

Is it indeed the lexer's job? Or should the lexer simply break up a string like 123.456 into the strings 123, ., 456 and let the parser figure out the rest? Doing this wouldn't be so straightforward with strings...

user541686
  • 205,094
  • 128
  • 528
  • 886
  • is this for something like Lex / YACC? – Grady Player Jun 12 '11 at 04:33
  • @Grady: Kinda sorta... not really. xD I'm trying to make my own lexer (and hopefully parser) by hand, and I can't figure out which one should do what for parsing numbers and strings. I'm not using any external tools like Lex/YACC, so in that regard, no. – user541686 Jun 12 '11 at 04:34
  • Intriguing, but still unclear. Do you mean is it the lexer's job to identify the *type* of number for the parser, beyond identifying the token? In which case, as you say, you have kind of a circular reference (how can it know the format to parse before knowing the type). I think this is why C# (for instance) allows suffixes on floating-point numbers (F and D) to disambiguate floats from doubles for the lexer. My guess is, when the lexer's done, there should be no question about the tokens, but there may be some question about the types. Literal formats have to be crafted to allow this. – harpo Jun 12 '11 at 05:15
  • 1
    @harpo: No, I'm not referring to the types. (Or am I? I think I got confused by what you meant by "type".) I'm just referring to the fact that, for a lexer to be able to say "here is a string literal", it would need to process all the escape codes and everything. But if all it does is just returning a portion of the input, then the parser will have to do the exact same thing again -- which is duplication, isn't it? – user541686 Jun 12 '11 at 05:20
  • You mean it has to "process" escape sequences by recognizing that they do not terminate the string, but are part of it; and by handing this "literally" to the parser, it then becomes the parser's job to encode them. Alternatively, the lexer could perform the encoding and hand the parser a "ready" string. The latter feels more right to me, for strings, but for numbers that would seem to imply that the lexer is *converting* input to a ready number for the parser, which does seem to be going out of scope. Just thinking out loud here. – harpo Jun 12 '11 at 05:52
  • And who said you need a lexer at all? The distinction between lexical analisys (tokenezation) and parsing is pretty arbitrary and is a property of the tools used, not the parsing per se. – Gene Bushuyev Jun 12 '11 at 07:15

3 Answers3

34

The simple answer is "Yes".

In the abstract, you don't need lexers at all. You could simply write a grammer that used individual characters as tokens (and in fact that's exactly what SGLR parsers do, but that's a story for another day).

You need lexers because parsers built using characters as primitive elements aren't as efficient as parsers that break the input stream into "tokens", where tokens are the primitive elements of the language you are parsing (whitespace, keywords, identifiers, numbers, operators, strings, comments, ...). [If you don't care about efficiency you can skip the rest of this answer and go read about SGLR parsers].

Good lexers typically take sets of regular expressions representing the language elements, and compile them into an efficient finite state machine that can segment the input stream into such language elements quickly. (If you don't want to user a lexer generator, for simple languages you can code the FSA yourself). Such compiled FSAs execute only a few tens of machine instructions per input character (get character from input buffer, switch on character to new state, decide if token is complete, if not do it again), and can thus be extremely fast.

The output of such lexers is typically a code representing the langauge element (or nothing for whitespace if the parser would ignore it anyway) and some position information (starts in file foo, line 17 column 3) to enable error reporting.

One can stop there and have useful lexers. It is often useful to do a conversion step, that converts the character string into the equivalent native machine value for that token, either as the characters are collected, or when the token is complete, because one still has knowledge of the specific characters involved in the token. This is used to convert numbers (of varying radixes) in the target language to their native binary equivalent, to convert literal strings containing escape sequences into the actual characters making up the string, and even taking identifier names and looking them up in a hash table so that identical identifiers are easily determined. The parser typically isn't interested in these converted values, but steps beyond parsing (semantic analysis, checking for optimizations, code generation) needs the converted values anyway, so you might as well convert them as you discover them. (You could delay this conversion until their binary value was needed, but in practice you almost always need the value so delaying the conversion doesn't buy very much).

Ira Baxter
  • 93,541
  • 22
  • 172
  • 341
  • Thanks for the detailed answer, it's really helpful, +1. One question, though: You mention that `One can stop there and have useful lexers`. If I stop there, that means that the parser will have to process the literals *again* (to figure out escape codes, parse numbers, etc.), am I correct? Hence it will involve duplication of code/work? – user541686 Jun 12 '11 at 05:24
  • The parser won't need to; it is only interested in the abstract tokens and not their actual values. Steps *after* the parser will be interested, and there the literals will have to be processed again as you said. It is always more expensive to rediscover and use information you already knew, than to take advantage of that information when you knew it. – Ira Baxter Jun 12 '11 at 05:32
  • The ideal lexer will carry out the value conversions as it see characters that *might* be part of a lexeme. If you hand code your lexer, you will easily see this opportunity (as you process a digit, it is easy to multiply an accumulator by 10 and add the value). In practice, this is a bit messier, but you if pay the price of the mess, you get extremely fast lexers that convert as they go. – Ira Baxter Jun 12 '11 at 05:33
  • Your last paragraph doesn't play at all with cross-compilers. Conversion to target forrmats is a function of code generation, not input scanning. I was taught this at length by Frank de Remer in 1979. – user207421 Jan 29 '16 at 21:06
  • @EJP: So the code generate read the string characters comprising the input language's notation, and *then* decided what to do? How is that an advantage? How did your cross compilers generate target-language compatible literals, without knowing what the source language literals *mean*? Even if you thought it was the cross-compilers decision as to how to represent the converted values, it is just fine having a canonical machine value so this sure doesn't hurt. Our experience with building many cross-compilers is that this technique is pretty convenient. – Ira Baxter Jan 29 '16 at 21:27
  • Consider a cross-compiler on a big-endian machine targeting a little-endian machine. 'Canonical machine value' =! 'target representation', and doesn't appear in your paragraph. What you actually said was 'native machine value'. Anyway take it up with Frank ;-) – user207421 Jan 29 '16 at 21:32
  • @EJP: Considering a cross compiler: presumably it knows the target is little endian, and it can know or be told its native machine value is big or little endian. With that knowledge, it is straightforward to generate the proper little-endian value. I don't have to take it up with Frank. – Ira Baxter Jan 29 '16 at 21:37
  • Hang on. I didn't rule out a canonical representation. I ruled out conversion in the front end to target representation. The front end is or should be machine-independent, so it doesn't 'know' itt is cross-compiling at all. Neither does the back end for that matter. – user207421 Jan 29 '16 at 21:41
  • @EJP: Nowhere did I say the front end should convert to *target* representation. Nor did I say *machine independent*. What I advocate is that the front end lexer produce canonical native values for the machine on which the translator is running (goes without saying: without loss of precision). This lets the translator manipulate that value with maximal convenience, using the native instruction set into which the translator program has presumably been compiled. I advocate the target code generator, if that is what you are doing, convert that native form to the desired output format. – Ira Baxter Jan 29 '16 at 21:44
  • May I direct you to the second sentence of your last paragraph of your answer, where you said exactly that, including the word 'native'. I agree entirely about the canonical representation, and *I* haven't said otherwise. But it's not what your answer above says. The word 'canonical' doesn't appear here until after my comment. – user207421 Jan 29 '16 at 21:53
  • So I added the word "canonical" to help clarify "native", not to replace or change it. This approach is consistent with your objection to front-end conversion to target representation . So I'm puzzled as to why you brought it up. – Ira Baxter Jan 30 '16 at 00:00
0

I assume you want to treat "123.456" as a whole value, in which case you will pass it wholesale to the parser, unless you need to code it somehow, like

struct DecimalRep{
    double mantissa,
    double exponent 

}

but I guess it depends all on what the parser expects.

Grady Player
  • 14,399
  • 2
  • 48
  • 76
  • So it should actually pass the data to the parser in a more suitable internal format, rather than just as a regular string? – user541686 Jun 12 '11 at 04:37
  • sounds like it is up to you ;) I would treat it as a single token. – Grady Player Jun 12 '11 at 04:38
  • also if they are going to be separate and independent, you wouldn't want to make a special data type for it. – Grady Player Jun 12 '11 at 04:39
  • @Mehrdad: You want the literal value to be converted to a representation which is easy for the rest of the compiler to manipulate. If you have a floating point literal, you shold convert it (if no loss of precision) to a IEEE format floating number; then if the compiler has to add two literals (e.g., constant folding) it can do that using the native IEEE floating point built into the CPU. Likewise for strings/nteger values. (If you leave all of these things as raw character sequences, you may save [probably not] some time by not doing the conversions, and then you get messy compiler guts. – Ira Baxter Oct 16 '17 at 20:43
  • @IraBaxter: Yeah, 6 years later and I understand these pretty well :) thanks! – user541686 Oct 16 '17 at 21:30
  • @IraBaxter: ... assuming that your compiler machine and target machine use the same FP format. If you're cross-compiling on x86 with target being S/360 or VAX, for instance, you wouldn't want to use the native IEEE FP format at all, and you might need to convert characters to EBCDIC. :-) I think this is the sort of problem EJP was trying to get at in the other series of comments. – torek Oct 16 '17 at 22:24
  • @torek: I take your (and EJP's) point. However, to the extent that doing the math on the compiling machine loses no interesting precision wrt target (yes, IEEE vs mainframe hex float math might be an issue, but 8 bit string concatenation not so much), you want it convenient for the compiler. When it emits the object code, it can do a last-moment conversion to the target data type. Agreed, if it matters, then you'll have to use a canonical infinitely accurate package, or the native target format (if the latter, doing compile time math with be annoying at best.) – Ira Baxter Oct 16 '17 at 23:14
  • @torek: Surely storing your floating point literal as a text string will make doing compile time math *really* awkward :-} – Ira Baxter Oct 16 '17 at 23:18
  • Yes. Some (poor) cross compilers don't bother to optimize at all, though, they just make the target machine do everything at run-time so that they do not have to bother implementing cross-system math libraries. It's even worse when they do compile-time math in the host-native format, though: users (especially library writers like me) sometimes need to get the target system to produce, say, its special "fp format violation" trap which occurs only for particular bit patterns that mean something else in IEEE float... – torek Oct 16 '17 at 23:26
0

A lexer essentially identifies TOKENs from the input. In this case, the lexer will possibly "match" the number as a float number TOKEN. The parser essentially processes tokens and does syntax analysis

Sai
  • 3,819
  • 1
  • 25
  • 28
  • @Sai: Yeah, the tokenization makes sense (hence why I added the tag to my question :P) but I'm not sure what the difference between "matching" and "processing" is here. E.g. so what should `"Hello\n World"` be turned into by the lexer? – user541686 Jun 12 '11 at 04:47
  • You can think of a lexer as a pre-processor that helps make it easy for the parser to parse. Typically "Hello" will be identified by the lexer as a WORD, \n as a NEWLINE, and World as a WORD – Sai Jun 12 '11 at 04:51
  • @Mehrdad -- sorry didn't read it correctly. I didnt realize that it was a string literal (even though it was clear as day). Sorry about that. This should typically be interpreted as a string literal. – Sai Jun 12 '11 at 05:10
  • 2
    To be clear, *traditional* string literals are extracted as a single token by most lexers. "Hello\World" would produce a single STRINGLITERAL entry whose binary value was the character codes for H, e, l, ... \n, W, o, ... d. (more...) – Ira Baxter Jun 13 '11 at 08:36
  • 2
    ... Depending on the language, the string literals might not be traditional, and thus not lexed traditionally. Our PHP front for DMS lexes doubly-quoted "string literals" as a series of string-literal fragments of different types. This is because such string literals in PHP are really implicit expressions that concatenate a bunch of string values, some of which really are literal string fragments, some of which are interpolated variables or other operators allowed by PHP in the "middle" of a literal string. This lexer is still doing the right thing: picking up language elements. – Ira Baxter Jun 13 '11 at 08:38
  • @Mehrdad +1 for a Great Question and @Ira +1 for a great explanation! – Sai Jun 13 '11 at 16:00