Why are λ-calculus optimal evaluators able to compute big modular exponentiations without formulas?

Question

Church numbers are an encoding of natural numbers as functions.

(\ f x → (f x))             -- church number 1
(\ f x → (f (f (f x))))     -- church number 3
(\ f x → (f (f (f (f x))))) -- church number 4

Neatly, you can exponentiate 2 church numbers by just applying them. That is, if you apply 4 to 2, you get the church number 16, or 2^4. Obviously, that is utterly unpractical. Church numbers need a linear amount of memory and are really, really slow. Computing something like 10^10 - which GHCI quickly answers correctly - would take ages and couldn't fit the memory on your computer anyway.

I've been experimenting with optimal λ evaluators lately. On my tests, I accidentally typed the following on my optimal λ-calculator:

10 ^ 10 % 13

It was supposed to be multiplication, not exponentiation. Before I could move my fingers to abort the forever-running program in despair, it answered my request:

3
{ iterations: 11523, applications: 5748, used_memory: 27729 }

real    0m0.104s
user    0m0.086s
sys     0m0.019s

With my "bug alert" flashing, I went to Google and verified, 10^10%13 == 3 indeed. But the λ-calculator wasn't supposed to find that result, it can barely store 10^10. I started stressing it, for science. It instantly answered me 20^20%13 == 3, 50^50%13 == 4, 60^60%3 == 0. I had to use external tools to verify those results, since ~~Haskell itself wasn't able to compute it (due to integer overflow)~~ (it is if you use Integers not Ints, of course!). Pushing it to its limits, this was the answer to 200^200%31:

5
{ iterations: 10351327, applications: 5175644, used_memory: 23754870 }

real    0m4.025s
user    0m3.686s
sys 0m0.341s

If we had one copy of the universe for each atom on the universe, and we had a computer for each atom we had in total, we couldn't store the church number 200^200. This prompted me to question if my mac was really that powerful. Maybe the optimal evaluator was able to skip the unnecessary branches and arrive right at the answer in the same fashion Haskell does with lazy evaluation. To test this, I compiled the λ program to Haskell:

data Term = F !(Term -> Term) | N !Double
instance Show Term where {
    show (N x) = "(N "++(if fromIntegral (floor x) == x then show (floor x) else show x)++")";
    show (F _) = "(λ...)"}
infixl 0 #
(F f) # x = f x
churchNum = F(\(N n)->F(\f->F(\x->if n<=0 then x else (f#(churchNum#(N(n-1))#f#x)))))
expMod    = (F(\v0->(F(\v1->(F(\v2->((((((churchNum # v2) # (F(\v3->(F(\v4->(v3 # (F(\v5->((v4 # (F(\v6->(F(\v7->(v6 # ((v5 # v6) # v7))))))) # v5))))))))) # (F(\v3->(v3 # (F(\v4->(F(\v5->v5)))))))) # (F(\v3->((((churchNum # v1) # (churchNum # v0)) # ((((churchNum # v2) # (F(\v4->(F(\v5->(F(\v6->(v4 # (F(\v7->((v5 # v7) # v6))))))))))) # (F(\v4->v4))) # (F(\v4->(F(\v5->(v5 # v4))))))) # ((((churchNum # v2) # (F(\v4->(F(\v5->v4))))) # (F(\v4->v4))) # (F(\v4->v4))))))) # (F(\v3->(((F(\(N x)->F(\(N y)->N(x+y)))) # v3) # (N 1))))) # (N 0))))))))
main = print $ (expMod # N 5 # N 5 # N 4)

This correctly outputs 1 (5 ^ 5 % 4) - but throw anything above 10^10 and it will be stuck, eliminating the hypothesis.

The optimal evaluator I used is a 160-lines long, unoptimized JavaScript program that didn't include any sort of exponential modulus math - and the lambda-calculus modulus function I used was equally simple:

(λab.(b(λcd.(c(λe.(d(λfg.(f(efg)))e))))(λc.(c(λde.e)))(λc.(a(b(λdef.(d(λg.(egf))))(λd.d)(λde.(ed)))(b(λde.d)(λd.d)(λd.d))))))

I used no specific modular arithmetic algorithm or formula. So, how is the optimal evaluator able to arrive at the right answers?

Can you tell us more about the type of optimal evaluation you use? Perhaps a paper citation? Thanks! — Jason Dagit, Jul 29 '15 at 18:25
I'm using Lamping's abstract algorithm, as explained on the [The Optimal Implementation of Functional Programming Languages](https://www.google.com.br/?gfe_rd=cr&ei=4hq5VZjHFKqk8weG0aDYAw&gws_rd=ssl#safe=active&q=the+optimal+implementation+of+functional) book. Notice I'm not using the "oracle" (no croissants/brackets) since that term is EAL-typeable. Also, instead of randomly reducing fans in parallel, I'm sequentially traversing the graph as to not reduce unreachable nodes, but I'm afraid this isn't on literature AFAIK... — MaiaVictor, Jul 29 '15 at 18:28
Okay, in case anyone is curious, I've set up a [GitHub repository](https://github.com/SrVictorMaia/optlam) with the source code for my optimal evaluator. It has many comments and you can test it running `node test.js`. Let me know if you have any questions. — MaiaVictor, Jul 29 '15 at 19:24
Neat find! I do not know enough about optimal evaluation, but I can say that this reminds me of Fermat's Little Theorem / Euler's Theorem. If you are unaware of it, it might be a good starting point. — luqui, Jul 29 '15 at 20:18
Are you sure the arcane `expMod` combinator does not really exploit some "specific modular arithmetic algorithm or formula"? I have no idea about how that incantation really works. Can you split the algorithm more cleanly into "computing exp" and "computing mod"? — chi, Jul 29 '15 at 22:18
Or more directly, what happens if you do `expMod(10(10), 1, 13)` so that the `exp` step is clearly separated? — chi, Jul 29 '15 at 22:21
@chi, `exp.mod` is just `(a b c -> (nat.mod (nat.pow a b) c))`. `nat.pow` is just `(m n -> (n m))` and `nat.mod` is `(n m -> (tuple.rev_enum m (args.select_mod m n)))`. You can consult the dependencies of `nat.mod` on the [Lambda Wiki](http://lambda.wiki/nat.mod), all is listed there. — MaiaVictor, Jul 29 '15 at 22:29
@Viclib OK, then. I actually hoped that was the case, but could not see from the code. — chi, Jul 29 '15 at 22:30
This is the first time where I don't have the slightest clue what the question is all about, but nevertheless upvote the question, and particularly, the outstanding first-post-answer. — Marco13, Jul 31 '15 at 15:50
Victor, are you by any chance currently in Oxford at ICFP ? (I could not find other way to contact you) — Łukasz Lew, Sep 04 '17 at 23:23
@ŁukaszLew no, I'm in Rio, sorry. Why? Feel free to mail me at srvictormaia@gmail.com or add me on fb.com/vhsmaia... — MaiaVictor, Sep 05 '17 at 00:42

score 131 · Accepted Answer · edited Jul 22 '17 at 17:40

The phenomenon comes from the amount of shared beta-reduction steps, which can be dramatically different in Haskell-style lazy evaluation (or usual call-by-value, which is not that far in this respect) and in Vuillemin-Lévy-Lamping-Kathail-Asperti-Guerrini-(et al…) "optimal" evaluation. This is a general feature, that is completely independent from the arithmetic formulas you could use in this particular example.

Sharing means having a representation of your lambda-term in which one "node" can describe several similar parts of the actual lambda-term you represent. For instance, you can represent the term

\x. x ((\y.y)a) ((\y.y)a)

using a (directed acyclic) graph in which there is only one occurrence of the subgraph representing (\y.y)a, and two edges targeting that subgraph. In Haskell terms, you have one thunk, that you evaluate only once, and two pointers to this thunk.

Haskell-style memoization implements sharing of complete subterms. This level of sharing can be represented by directed acyclic graphs. Optimal sharing does not have this restriction: it can also share "partial" subterms, which may imply cycles in the graph representation.

To see the difference between these two levels of sharing, consider the term

\x. (\z.z) ((\z.z) x)

If your sharing is restricted to complete subterms as it is the case in Haskell, you may have only one occurrence of \z.z, but the two beta-redexes here will be distinct: one is (\z.z) x and the other one is (\z.z) ((\z.z) x), and since they are not equal terms they cannot be shared. If the sharing of partial subterms is allowed, then it becomes possible to share the partial term (\z.z) [] (that is not just the function \z.z, but "the function \z.z applied to something), which evaluates in one step to just something, whatever this argument is. Hence you can have a graph in which only one node represents the two applications of \z.z to two distinct arguments, and in which these two applications can be reduced in just one step. Remark that there is a cycle on this node, since the argument of the "first occurrence" is precisely the "second occurrence". Finally, with optimal sharing you can go from (a graph representing) \x. (\z.z) ((\z.z) x)) to (a graph representing) the result \x.x in just one step of beta-reduction (plus some bookkeeping). This is basically what happens in your optimal evaluator (and the graph representation is also what prevents space explosion).

For slightly extended explanations, you can look at the paper Weak Optimality, and the Meaning of Sharing (what you are interested in is the introduction and the section 4.1, and maybe some of the bibliographic pointers at the end).

Coming back at your example, the coding of arithmetic functions working on Church integers is one of the "well-known" mines of examples where optimal evaluators can perform better than mainstream languages (in this sentence, well-known actually means that a handful of specialists are aware of these examples). For more such examples, take a look at the paper Safe Operators: Brackets Closed Forever by Asperti and Chroboczek (and by the way, you will find here interesting lambda-terms that are not EAL-typeable; so I’m encouraging you to take a look at oracles, starting with this Asperti/Chroboczek paper).

As you said yourself, this kind of encoding is utterly unpractical, but they still represent a nice way of understanding what is going on. And let me conclude with a challenge for further investigation: will you be able to find an example on which optimal evaluation on these supposedly bad encodings is actually on par with traditional evaluation on a reasonable data representation? (as far as I know this is a real open question).

That's a most unusually thorough first post. Welcome to StackOverflow! — dfeuer, Jul 30 '15 at 13:29
Nothing less than insightful. Thank you, and welcome to the community! — MaiaVictor, Jul 31 '15 at 00:01

Tom Ellis · Answer 2 · 2015-07-29T21:54:49.210

7

This isn't an anwser but it's a suggestion of where you might start looking.

There's a trivial way to calculate modular exponentiations in little space, specifically by rewriting

(a * x ^ y) % z

as

(((a * x) % z) * x ^ (y - 1)) % z

If an evaluator evaluates like this and keeps the accumulating parameter a in normal form then you will avoid using too much space. If indeed your evaluator is optimal then presumably it must not do any more work than this one, so in particular can't use more space than the time this one takes to evaluate.

I'm not really sure what an optimal evaluator really is so I'm afraid I can't make this more rigorous.

edited Jul 29 '15 at 21:54

answered Jul 29 '15 at 21:44

Tom Ellis

9,224
1
29
54

Interesting. A follow-up question would be: does that mean that any church-number function that I throw at the evaluator will compute as fast as it possibly could, even if I don't specifically use those "shortcut formulas"? – MaiaVictor Jul 29 '15 at 21:47
No, that seems highly implausible! Presumably discovering the optimal evaluation strategy for a given term can involve huge amounts of work before you even start performing the evaluation. – Tom Ellis Jul 29 '15 at 21:53
Hey, do you have suggestions of functions over natural numbers that are expensive to compute naively, but for which there is magical formula that does it quickly? – MaiaVictor Jul 29 '15 at 21:59
4

@Viclib Fibonacci as @Tom says is a good example. `fib` requires exponential time in the naive way, which can be reduced to linear with a simple memoization/dynamic programming. Even logarithmic (!) time is possible through computing the n-th matrix power of `[[0,1],[1,1]]` (as long as you count each multiplication to have a constant cost). – chi Jul 29 '15 at 22:15
1

Even constant time if you're daring enough to approximate :) – J. Abrahamson Jul 29 '15 at 23:13
Look at these references: * http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.90.2386 * http://www.lsv.ens-cachan.fr/Publis/PAPERS/PDF/sinot-wrs07.pdf Sometimes also known as "complete laziness". call-by-need reduces recalculation but doesn't allow sharing under abstractions. Complete laziness requires substantially more bookkeeping but doesn't need to recompute anything. – Jason Dagit Jul 29 '15 at 23:37
1

@chi, fair enough - I've tested and fib stays exponential. I posted the tests [as a gist](https://gist.github.com/SrVictorMaia/b735f9e06f8461585974). No wizardry this time. – MaiaVictor Jul 30 '15 at 02:42
@Viclib, it's hardly surprising that `fib n` takes exponential time to compute when its value also grows exponentially. How about computing `fib n` modulo some small constant? – Reid Barton Jul 30 '15 at 04:55
5

@TomEllis Why would something that only knows how to reduce arbitrary lambda calculus expressions have any idea that `(a * b) % n = ((a % n) * b) % n` though? That is the mysterious part surely. – Reid Barton Jul 30 '15 at 05:01
@chi You **cannot** compute fibonacci numbers in less than linear time, since they have a linear number of bits. The matrix trick requires a logarithmic number of *multiplications*, but the total amount of bit operations is still *linear*. You can't simply ignore the size of the coefficients of the matrix, since they grow quite a lot. If you use that assumption, well you could simply assume a fixed integer size and compute all fibonacci numbers in constant time or build a lookup table. – Bakuriu Jul 30 '15 at 05:36
2

@ReidBarton surely I tried it! Same results, though. – MaiaVictor Jul 30 '15 at 06:22
1

@ReidBarton That's still a qusetion that needs answering, but I can imagine that an optimal evaluator could somehow end up performing a similar transformation, even if by accident. For example, the modulo "algorithm" might well be recursive on multiplicative structure. On the other hand I *can't* imagine an optimal evaluator using Fermat's Little Theorem or Euler's theorem! – Tom Ellis Jul 30 '15 at 06:26
2

@TomEllis and Chi, There is just a small remark, though. That all assumes that traditional recursive function is the "naive" fib implementation, but IMO there is an alternative way to express it that is much more natural. The normal form of that new representation has half of the size of the traditional one), and Optlam manages to compute that one linearly! So I'd argue that is the "naive" definition of fib as far as λ-calculus is concerned. I'd make a blog post but I'm not sure it is really worth it... – MaiaVictor Jul 30 '15 at 17:02

Why are λ-calculus optimal evaluators able to compute big modular exponentiations without formulas?

2 Answers2