Implement a self-reference/pointer in a pure/functional language (Elm/Haskell)

Question

Abstract Problem:

I'd like to implement a self-reference / pointer in Elm.

Specific Problem:

I'm writing a toy LISP interpreter in Elm inspired by mal.

I'm attempting to implement something like letrec to support recursive and mutually-recursive bindings (the "self reference" and "pointers" I'm mentioning above).

Here's some example code:

(letrec
  ([count (lambda (items)
            (if (empty? items)
                0
                (+ 1 (count (cdr items)))
            )
          )
  ])
  (count (quote 1 2 3))
)
;=>3

Note how the body of the lambda refers to the binding count. In other words, the function needs a reference to itself.

Deeper Background:

When a lambda is defined, we need to create a function closure which consists of three components:

The function body (the expression to be evaluated when the function is called).
A list of function arguments (local variables that will be bound upon calling).
A closure (the values of all non-local variables that may be referenced within the body of the function).

From the wikipedia article:

Closures are typically implemented with [...] a representation of the function's lexical environment (i.e., the set of available variables) at the time when the closure was created. The referencing environment binds the non-local names to the corresponding variables in the lexical environment at the time the closure is created, additionally extending their lifetime to at least as long as the lifetime of the closure itself. When the closure is entered at a later time, possibly with a different lexical environment, the function is executed with its non-local variables referring to the ones captured by the closure, not the current environment.

Based on the above lisp code, in creating the lambda, we create a closure whose count variable must be bound to the lambda, thereby creating an infinite/circular/self-reference. This problem gets further complicated by mutually-recursive definitions which must be supported by letrec as well.

Elm, being a pure functional language, does not support imperative modification of state. Therefore, I believe that it is impossible to represent self-referencing values in Elm. Can you provide some guidance on alternatives to implementing letrec in Elm?

Research and Attempts

Mal in Elm

Jos von Bakel has already implemented mal in Elm. See his notes here and the environment implementation here. He's gone to great lengths to manually build a pointer system with its own internal GC mechanism. While this works, this seems like massive amounts of struggle. I'm craving a pure functional implementation.

Mal in Haskell

The mal implementation in Haskell (see code here) uses Data.IORef to emulate pointers. This also seems like hack to me.

Y-Combinator/Fixed Points

It seems possible that the Y-Combinator can be used to implement these self references. There seems to be a Y* Combinator that works for mutually recursive functions as well. It seems logical to me that there must also exist a Z* combinator (equivalent to Y* but supports the eager evaluation model of Elm). Should I transform all of my letrec instances so that each binding is wrapped around a Z*?

The Y-Combinator is new to me and my intuitive mind simply does not understand it so I'm not sure if the above solution will work.

Conclusion

Thank you very much for reading! I have been unable to sleep well for days as I struggle with this problem.

Thank You!

-Advait

@luqui because bindings resolve in order this should result in a reference error. — advait, Feb 29 '20 at 02:46
On the other hand `(letrec ([x (lambda () y)] [y 2]) (x))` should resolve to `2` as the "return expression" evaluates *after* the bindings are set. This is how you might end up with a mutually recursive definition. — advait, Feb 29 '20 at 06:34
ah! You see, there *is* state in these semantics. `y` denotes different things depending on when it is evaluated. — luqui, Feb 29 '20 at 16:10
@luqui I agree that "y denotes different things depending on when it is evaluated", however, I'm not sure why this point is significant. Surely there is some notion of state in the semantics of binding creation. Can you help me better understand what you're trying to get at? — advait, Feb 29 '20 at 21:05
in Scheme, letrec's bindings are created [not in any order](https://stackoverflow.com/a/15006018/849891). In Haskell, `let {x = y ; y = 2} in x` is perfectly well defined. — Will Ness, Feb 29 '20 at 21:06
you build your own interpreter, you can do anything in it. When you build the letrec's environment frame, put naked lambdas into it. when interpreting, you *know* those lambdas are defined *in* that frame. if you need to return a closure from inside letrec, you return a pairing of the naked lambda and the letrec's frame. there' no problem here. :) again: if all you have in value slots are lambdas, there's no problem. if you have values there, referring to the letrec's variables, then, just forbid this! it is forbidden in Scheme, too. (or results in errors, whatever). — Will Ness, Feb 29 '20 at 21:22
@advait, well you kept objecting to the stateful solutions as "hacks", but your semantics are stateful. While there are different levels of hackiness (as a haskell boy, I agree `IORef` for this is too much), if you are implementing this in a stateless language, you cannot avoid using *some* model of state, because your language is stateful. — luqui, Feb 29 '20 at 23:17
@liqui thank you for the clarification. You are correct that we'll need some model of state. However, I believe there are better and worse ways of implementing state in an FP language. I stand by my assertion that the aforementioned approaches are hacks - especially when juxtaposed to K. A. Buhr's beautiful solution below which either represents the state explicitly or uses the Reader monad - both elegant ways of managing the state. — advait, Mar 02 '20 at 05:38

score 2 · Answer 1 · edited Feb 29 '20 at 06:02

A binding construct in which the expressions can see the bindings doesn't require any exotic self-reference mechanisms.

How it works is that an environment is created for the variables, and then the values are assigned to them. The initializing expressions are evaluated in the environment in which those variables are already visible. Thus if those expressions happen to be lambda expressions, then they capture that environment, and that's how the functions can refer to each other. An interpreter does this by extending the environment with the new variables, and then using the extended environment for evaluating the assignments. Similarly, a compiler extends the compile-time lexical environment, and then compiles the assignments under that environment, so the running code will store values into the correct frame locations. If you have working lexical closures, the correct behavior of functions being able to mutually recurse just pops out.

Note that if the assignments are performed in left to right order, and one of the lambdas happens to be dispatched during initialization, and then happens to make a forward call to one of lambdas through a not-yet-assigned variable, that will be a problem; e.g.

(letrec
  ([alpha (lambda () (omega)]
   [beta (alpha)] ;; problem: alpha calls omega, not yet stored in variable.
   [omega (lambda ())])
  ...)

Note that in the R7RS Scheme Report, P16-17, letrec is in fact documented as working like this. All the variables are bound, and then they are assigned the values. If the evaluation of an init expression refers to the same variable that is being initialized, or to later variables not yet initialized, R7RS says that it is an error. The document also specifies a restriction regarding the use of continuations captured in the initializing expressions.

I think the problem is in the part where "and then the values are assigned to them". I believe that pure functional languages like Elm/Haskell do not support such assignment. It may be possible to use StateT or some other mechanism but this is where I'm looking for guidance. Thank you for your answer! — advait, Feb 28 '20 at 23:39

K. A. Buhr · Accepted Answer · 2020-02-29T21:57:48.073

In Haskell, this is fairly straightforward thanks to lazy evaluation. Because Elm is strict, to use the technique below, you would need to introduce laziness explicitly, which would be more or less equivalent to adding a pointer indirection layer of the sort you mentioned in your question.

Anyway, the Haskell answer might be useful to someone, so here goes...

Fundamentally, a self-referencing Haskell value is easily constructed by introducing a recursive binding, such as:

let mylist = [1,2] ++ mylist in mylist

The same principle can be used in writing an interpreter to construct self-referencing values.

Given the following simple S-expression language for constructing potentially recursive / self-referencing data structures with integer atoms:

data Expr = Atom Int | Var String | Cons Expr Expr | LetRec [String] [Expr] Expr

we can write an interpreter to evaluate it to the following type, which doesn't use IORefs or ad hoc pointers or anything weird like that:

data Value = AtomV Int | ConsV Value Value deriving (Show)

One such interpreter is:

type Context = [(String,Value)]

interp :: Context -> Expr -> Value
interp _ (Atom x) = AtomV x
interp ctx (Var v) = fromJust (lookup v ctx)
interp ctx (Cons ca cd) = ConsV (interp ctx ca) (interp ctx cd)
interp ctx (LetRec vs es e)
  = let ctx' = zip vs (map (interp ctx') es) ++ ctx
    in  interp ctx' e

This is effectively a computation in a reader monad, but I've written it explicitly because a Reader version would require using the MonadFix instance either explicitly or via the RecursiveDo syntax and so would obscure the details.

The key bit of code is the case for LetRec. Note that a new context is constructed by introducing a set of potentially mutually recursive bindings. Because evaluation is lazy, the values themselves can be computed with the expression interp ctx' es using the newly created ctx' of which they are part, tying the recursive knot.

We can use our interpreter to create a self-referencing value like so:

car :: Value -> Value
car (ConsV ca _cd) = ca

cdr :: Value -> Value
cdr (ConsV _ca cd) = cd

main = do
  let v = interp [] $ LetRec ["ones"] [Cons (Atom 1) (Var "ones")] (Var "ones")

  print $ car $ v
  print $ car . cdr $ v
  print $ car . cdr . cdr $ v
  print $ car . cdr . cdr . cdr . cdr . cdr . cdr . cdr . cdr . cdr . cdr $ v

Here's the full code, also showing an alternative interp' using the Reader monad with recursive-do notation:

{-# LANGUAGE RecursiveDo #-}
{-# OPTIONS_GHC -Wall #-}

module SelfRef where

import Control.Monad.Reader
import Data.Maybe

data Expr = Atom Int | Var String | Cons Expr Expr | LetRec [String] [Expr] Expr
data Value = AtomV Int | ConsV Value Value deriving (Show)

type Context = [(String,Value)]

interp :: Context -> Expr -> Value
interp _ (Atom x) = AtomV x
interp ctx (Var v) = fromJust (lookup v ctx)
interp ctx (Cons ca cd) = ConsV (interp ctx ca) (interp ctx cd)
interp ctx (LetRec vs es e)
  = let ctx' = zip vs (map (interp ctx') es) ++ ctx
    in  interp ctx' e

interp' :: Expr -> Reader Context Value
interp' (Atom x) = pure $ AtomV x
interp' (Var v) = asks (fromJust . lookup v)
interp' (Cons ca cd) = ConsV <$> interp' ca <*> interp' cd
interp' (LetRec vs es e)
  = mdo let go = local (zip vs vals ++)
        vals <- go $ traverse interp' es
        go $ interp' e

car :: Value -> Value
car (ConsV ca _cd) = ca

cdr :: Value -> Value
cdr (ConsV _ca cd) = cd

main = do
  let u = interp [] $ LetRec ["ones"] [Cons (Atom 1) (Var "ones")] (Var "ones")
  let v = runReader (interp' $ LetRec ["ones"] [Cons (Atom 1) (Var "ones")] (Var "ones")) []

  print $ car . cdr . cdr . cdr . cdr . cdr . cdr . cdr . cdr . cdr . cdr $ u
  print $ car . cdr . cdr . cdr . cdr . cdr . cdr . cdr . cdr . cdr . cdr $ v

Thank you very much for your guidance! Your examples have been extremely elucidating. I'm struck by the sheer beauty of mutually recursive / lazy definitions. I've gone ahead and implemented your recommendation successfully! If anyone is interested, the code is available [here](https://github.com/advait/spillem/blob/adb9f1e24c4cde1380b384123ea9495678147d85/src/Eval.elm?ts=2#L84). — advait, Mar 02 '20 at 05:28

score 1 · Answer 3 · edited Jun 20 '20 at 09:12

The U combinator

I am late to the party here, but I got interested and spent some time working out how to do this in a Lisp-family language, specifically Racket, and thought perhaps other people might be interested.

I suspect that there is lots of information about this out there, but it's seriously hard to search for anything which looks like '*-combinator' now (even now I am starting a set of companies called 'Integration by parts' and so on).

You can, as you say, do this with the Y combinator, but I didn't want to do that because Y is something I find I can understand for a few hours at a time and then I have to work it all out again. But it turns out that you can use something much simpler: the U combinator. It seems to be even harder to search for this than Y, but here is a quote about it:

In the theory of programming languages, the U combinator, U, is the mathematical function that applies its argument to its argument; that is U(f) = f(f), or equivalently, U = λ f . f(f).

Self-application permits the simulation of recursion in the λ-calculus, which means that the U combinator enables universal computation. (The U combinator is actually more primitive than the more well-known fixed-point Y combinator.)

The expression U(U), read U of U, is the smallest non-terminating program, [...].

(Text from here, which unfortunately is not a site all about the U combinator other than this quote.)

Prerequisites

All of the following code samples are in Racket. The macros are certainly Racket-specific. To make the macros work you will need syntax-parse via:

(require (for-syntax syntax/parse))

However note that my use of syntax-parse is naïve in the extreme: I'm really just an unfrozen CL caveman pretending to understand Racket's macro system.

Also note I have not ruthlessly turned everything into λ: there are lets in this code, use of multiple values including let-values, (define (f ...) ...) and so on.

Two versions of U

The first version of U is the obvious one:

(define (U f)
  (f f))

But this will run into some problems with an applicative-order language, which Racket is by default. To avoid that we can make the assumption that (f f) is going to be a function, and wrap that form in another function to delay its evaluation until it's needed: this is the standard trick that you have to do for Y in an applicative-order language as well. I'm only going to use the applicative-order U when I have to, so I'll give it a different name:

(define (U/ao f)
  (λ args (apply (f f) args)))

Note also that I'm allowing more than one argument rather than doing the pure-λ-calculus thing.

Using U to construct a recursive functions

To do this we do a similar trick that you do with Y: write a function which, if given a function as argument which deals with the recursive cases, will return a recursive function. And obviously I'll use the Fibonacci function as the canonical recursive function.

So, consider this thing:

(define fibber
  (λ (f)
    (λ (n)
      (if (<= n 2)
          1
          (+ ((U f) (- n 1))
             ((U f) (- n 2)))))))

This is a function which, given another function, U of which computes smaller Fibonacci numbers, will return a function which will compute the Fibonacci number for n.

In other words, U of this function is the Fibonacci function!

And we can test this:

> (define fibonacci (U fibber))
> (fibonacci 10)
55

So that's very nice.

Wrapping U in a macro

So, to hide all this the first thing to do is to remove the explicit calls to U in the recursion. We can lift them out of the inner function completely:

(define fibber/broken
  (λ (f)
    (let ([fib (U f)])
      (λ (n)
        (if (<= n 2)
            1
            (+ (fib (- n 1))
               (fib (- n 2))))))))

Don't try to compute U of this: it will recurse endlessly because (U fibber/broken) -> (fibber/broken fibber/broken) and this involves computing (U fibber/broken), and we're doomed.

Instead we can use U/ao:

(define fibber
  (λ (f)
    (let ([fib (U/ao f)])
      (λ (n)
        (if (<= n 2)
            1
            (+ (fib (- n 1))
               (fib (- n 2))))))))

And this is all fine ((U fibber) 10) is 55 (and terminates!).

And this is really all you need to be able to write the macro:

(define-syntax (with-recursive-binding stx)
  (syntax-parse stx
    [(_ (name:id value:expr) form ...+)
     #'(let ([name (U (λ (f)
                        (let ([name (U/ao f)])
                          value)))])
         form ...)]))

And this works fine:

(with-recursive-binding (fib (λ (n)
                               (if (<= n 2)
                                   1
                                   (+ (fib (- n 1))
                                      (fib (- n 2))))))
  (fib 10))

A caveat on bindings

One fairly obvious thing here is that there are two bindings constructed by this macro: the outer one, and an inner one of the same name. And these are not bound to the same function in the sense of eq?:

(with-recursive-binding (ts (λ (it)
                              (eq? ts it)))
  (ts ts))

is #f. This matters only in a language where bindings can be mutated: a language with assignment in other words. Both the outer and inner bindings, unless they have been mutated, are to functions which are identical as functions: they compute the same values for all values of their arguments. In fact, it's hard to see what purpose eq? would serve in a language without assignment.

This caveat will apply below as well.

Two versions of U for many functions

The obvious generalization of U, U*, to many functions is that U*(f1, ..., fn) is the tuple (f1(f1, ..., fn), f2(f1, ..., fn), ...). And a nice way of expressing that in Racket is to use multiple values:

(define (U* . fs)
  (apply values (map (λ (f)
                       (apply f fs))
                     fs)))

And we need the applicative-order one as well:

(define (U*/ao . fs)
  (apply values (map (λ (f)
                       (λ args (apply (apply f fs) args)))
                     fs)))

Note that U* is a true generalization of U: (U f) and (U* f) are the same.

Using U* to construct mutually-recursive functions

I'll work with a trivial pair of functions:

an object is a numeric tree if it is a cons and its car and cdr are numeric objects;
an objct is a numeric object if it is a number, or if it is a numeric tree.

So we can define 'maker' functions (with an '-er' convention: a function which makes an x is an xer, or, if x has hyphens in it, an x-er) which will make suitable functions:

(define numeric-tree-er
  (λ (nter noer)
    (λ (o)
      (let-values ([(nt? no?) (U* nter noer)])
        (and (cons? o)
             (no? (car o))
             (no? (cdr o)))))))

(define numeric-object-er
  (λ (nter noer)
    (λ (o)
      (let-values ([(nt? no?) (U* nter noer)])
        (cond
          [(number? o) #t]
          [(cons? o) (nt? o)]
          [else #f])))))

Note that for both of these I've raised the call to U* a little, simply to make the call to the appropriate value of U* less opaque.

And this works:

(define-values (numeric-tree? numeric-object?)
  (U* numeric-tree-er numeric-object-er))

And now:

> (numeric-tree? 1)
#f
> (numeric-object? 1)
#t
> (numeric-tree? '(1 . 2))
#t
> (numeric-tree? '(1 2 . (3 4)))
#f

Wrapping U* in a macro

The same problem as previously happens when we raise the inner call to U* with the same result: we need to use U*/ao. In addition the macro becomes significantly more hairy and I'm moderately surprised that I got it right so easily. It's not conceptually hard: it's just not obvious to me that the pattern-matching works.

(define-syntax (with-recursive-bindings stx)
  (syntax-parse stx
    [(_ ((name:id value:expr) ...) form ...+)
     #:fail-when (check-duplicate-identifier (syntax->list #'(name ...)))
     "duplicate variable name"
     (with-syntax ([(argname ...) (generate-temporaries #'(name ...))])
       #'(let-values
             ([(name ...) (U* (λ (argname ...)
                                (let-values ([(name ...)
                                              (U*/ao argname ...)])
                                  value)) ...)])
           form ...))]))

And now, in a shower of sparks, we can write:

(with-recursive-bindings ((numeric-tree?
                           (λ (o)
                             (and (cons? o)
                                  (numeric-object? (car o))
                                  (numeric-object? (cdr o)))))
                          (numeric-object?
                           (λ (o)
                             (cond [(number? o) #t]
                                   [(cons? o) (numeric-tree? o)]
                                   [else #f]))))
  (numeric-tree? '(1 2 3 (4 (5 . 6) . 7) . 8)))

and get #t.

As I said, I am sure there are well-known better ways to do this, but I thought this was interesting enough not to lose.