How is APL optimized to have great performance at array processing? What are some example tricks and optimizations it performs?

Question

I am interested in how APL is so efficient at what it does, to the point of sometimes being benchmarked as outperforming C. So I'm curious, what are some of the optimizations done by the APL compiler to make the language so efficient?

"APL compiler" [tag:compiler-optimization] [tag:interpreted-language]‽ — Adám, Jul 03 '19 at 22:30
@Adám I'm confused at what this implies. APL is both compiled and interpreted, usually interpreted. I'm interested in optimizations done within these processes to make the language so performant at what it does. — Dan Harmon, Jul 03 '19 at 22:32
Ah OK, then it should probably say *done by the APL interpreter/compiler to…* — Adám, Jul 03 '19 at 22:34
For all practical purposes, APL is an interpreted language. Historically, some specialised expressions may have implemented a JITter, for example +/ and +\ in Sharp APL in the 1980s. APL compilers do exist, but arguably have not reached the "commercial product" level. See http://www.snakeisland.com/ — Lobachevsky, Aug 04 '19 at 08:56
Also https://www.springer.com/gp/book/9780387966434, http://web.engr.oregonstate.edu/~budd/Books/aplc/ (a version is available for download), https://groups.google.com/forum/#!topic/comp.lang.apl/_gEOpAtbq3Y (look for APLc) — Lobachevsky, Aug 04 '19 at 09:16

Adám · Answer 1 · 2019-07-05T05:20:14.610

Here are some examples of how it is done:

Stackless Traversal: blog
Packed Bit Booleans: blog, video
Vector Instructions: video
Hashed arrays: video, documentation
Special code for certain phrases: video
Pipelines and CRCs: video

Relatedly, these discuss the principles behind the above:

Choosing algorithms depending on data patterns seen at runtime and how a lazy/thunked APL could skip some code entirely: video
Less seek latency by reading entire simple arrays and avoiding branch prediction failures through branchless code: video (APL aligns with these ideas, and encourages these styles, more easily than many other languages)

score 5 · Answer 2 · answered Jul 04 '19 at 12:34

You cannot compare two languages (like C vs. APL) as such in terms of performance because the performance depends considerably on the implementations of the languages and the libraries used. The same C program can be slow on one platform (read: Windows) and fast on another. The key point is that performance is almost entirely a property of a given implementation of a language and not a property of the language itself.

In the case of APL one can split the CPU cycles needed for a given operation into two parts: the interpreter overhead (processing of the tokens that make up an APL program) and the primitives (such as addition, reduce, etc). In most APL interpreters the interpreter overhead is rather small (which implies that optimizations on that part cannot gain much performance (aka. Amdahls law). In early APLs (say 1970) that was different. The processing of the APL primitives in current interpreters is implemented in C/C++ so that part of the CPU cycles is performance-wise the same as for C (again keeping in mind that the implementation could make a difference).

I have performed some benchmarks at the level of APL primitives (different scalar functions from simple (integer addition) to not so simple (complex arcus cosinus) and for outer products of them. The somewhat surprising result was that the performance of different scalar functions was not dominated by the complexity of the computed function but by the access time to/from the memory (including caches) and by the branch prediction of the CPU. As an example if you do the same APL operation in a loop then the second iteration would typically be twice as fast as the first and the sugsequent iteration would stabilize after about the fourth iteration (on an i5-4570 CPU).

The measured values were fluctuating a lot, which makes old-fashioned performance measurents (like interpreter X is twice as fast as interpreter Y) rather meaningless.

As a rule of thumb, if the average vector size (i.e. ⍴,X) of your APL program is 20 or more, then you can entirely ignore the interpreter overhead and the APL program has roughly the same performance as a comparable C program.

The cases where APL is faster than C (which is impossible in theory) can frequently be traced down to the use of different algorithms in C and in APL. A typical real-life example is sorting with heapsort in one case and with quicksort in the other. This is again a difference in implementations and not a difference of the languages themselves.

Some APL interpreters are actually partly written in assembler to take advantage of fancy algorithms that the C compiler cannot figure out to use. — Adám, Jul 04 '19 at 17:55
*same performance as a **comparable** C program* — but that's exactly the point. For the C programmer to get the same performance, they'd basically have to go through the evolutionary fine-tuning that the APL interpreter has had decades for. That is the strength of interpreted languages, that improvements in the interpreter benefits all existing code. This is much like improvements in the C compiler benefits all re-compilations, but interpretation adds an additional beneficial layer. Without this, it is simply unrealistic to get the same performance in C as in APL. — Adám, Jul 04 '19 at 17:58
*sorting with heapsort in one case and with quicksort in the other* — no, the C programmer may choose one algorithm or another, but the APL interpreter can afford to have all sorting algorithms implemented and then, on the fly, choose the one it estimates will be fastest for that particular sorting instance, based on how the data looks. Most C programmers will not go to such lengths every time they just need something sorted, as that would take too much effort. In the interpreted language, each method just has to be implemented once, together with some heuristics to decide which one to use. — Adám, Jul 04 '19 at 18:03

How is APL optimized to have great performance at array processing? What are some example tricks and optimizations it performs?

2 Answers2