2

Say a language defines adjacency of two mathematical unicode alphanumerical symbols as an operator. Say, +1 means %adj + 1, where %adj stands for whatever operator adjacency defines, multiplication in this case. I was wondering, can any existing lexical analysis tool handle this?

D. Huang
  • 445
  • 1
  • 4
  • 5
  • Are you assuming that all variable names consist of a single letter or do you want the lexer to disambiguate based on which variables are defined (like if `xy` is defined as a variable, `xy` should be a single token, otherwise it should be two)? – sepp2k Jan 09 '17 at 13:31
  • This issue was treated in significant depth in a [paper](http://www.stroustrup.com/whitespace98.pdf) that may have been forgotten by now (with the proverbial grain of salt.) – shinobi Jan 09 '17 at 19:32
  • @shinobi: That paper is still a fun read after almost two decades, but casual readers should be warned that its publication date was [April 1](http://www.stroustrup.com/whitespace.html), 1998 – rici Jan 10 '17 at 05:07
  • @sepp2k: here the and are unicode alphanumerical symbols, different than regular x and y. xy is treated as a single token, but should stand for multiplication – D. Huang Jan 10 '17 at 11:35

3 Answers3

1

Invisible operators cannot be recognized with lexical analysis, for reasons which should be more or less obvious. You can only deduce the presence of an invisible operator by analyzing the syntactic context, which is the role of a parser.

Of course, most lexical analysis tools allow arbitrary code to be executed for each recognized token, so nothing stops you from building a state machine, or even a complete parser, into the lexical scanner. That is rarely good design.

If your language is unambiguous, then there is no problem handling adjacency in your grammar. But some care must be taken. For example, you would rarely want x-4 to be parsed as a multiplication of x and -4, but a naive grammar which included, eg.,

expr -> term | expr '-' term
term -> factor | term factor | term '*' factor
factor -> ID | NUMBER | '(' expr ')' | '-' factor

would include that ambiguity. To resolve it, you need to disallow the adjacency production with a second operand starting with a unary operator:

expr -> term | expr '-' term
term -> factor | term item | term '*' factor
factor -> item | '-' factor
item -> ID | NUMBER | '(' expr ')'

Note the difference between term -> term '*' factor, which allows x * - y, and term -> term base, which does not allow x - y (expr -> expr '-' term recognizes x - y as a subtraction).

For examples of context-free grammars which allow adjacency as an operator, see, for example, Awk, in which adjacency represents string concatenation, and Haskell, in which it represents function application.


Since this question comes up from time to time, there are a number of relevant answers already on SO. Here are a few:

Community
  • 1
  • 1
rici
  • 234,347
  • 28
  • 237
  • 341
0

Here is one example using pyparsing in Python:

import pyparsing as pp

integer = pp.pyparsing_common.integer()
variable = pp.oneOf(list("abcdefghijklmnopqrstuvwxyz"))

base_operand = integer | variable

implied_multiplication = pp.Empty().addParseAction(lambda: "*")
expr = pp.infixNotation(base_operand,
                [
                    ("**", 2, pp.opAssoc.LEFT),
                    (implied_multiplication, 2, pp.opAssoc.LEFT),
                    (pp.oneOf("+ -"), 1, pp.opAssoc.RIGHT),
                    (pp.oneOf("* /"), 2, pp.opAssoc.LEFT),
                    (pp.oneOf("+ -"), 2, pp.opAssoc.LEFT),
                ])

This assumes that variables are just single characters. There is also some fudging of precedence of operations to make adjacency, exponentiation, and leading signs work. The parse action added to the implied_multiplication expression is there to show the insertion of the multiplication operator.

Here is some test output:

tests = """
    x-4
    ax**2 + bx +c
    ax**2-bx+c
    mx+b
    """
expr.runTests(tests, fullDump=False)

prints:

x-4
[['x', '-', 4]]

ax**2 + bx +c
[[['a', '*', ['x', '**', 2]], '+', ['b', '*', 'x'], '+', 'c']]

ax**2-bx+c
[[['a', '*', ['x', '**', 2]], '-', ['b', '*', 'x'], '+', 'c']]

mx+b
[[['m', '*', 'x'], '+', 'b']]
PaulMcG
  • 62,419
  • 16
  • 94
  • 130
0

Unless tokens have a fixed length, you must separate adjacent tokens of the same type with some other token or whitespace. The Gosu programming language incorporates adjacency to implement "binding expressions" which support units:

var length = 10m  // 10 meters

var work = 5kg * 9.8 m/s/s * 10m
print( work )  // prints 490 J

var investment = 5000 EUR + 10000 USD

var date = 1966-May-5 2:35:53:909 PM PST
Scott
  • 949
  • 9
  • 14