Efficiently computing multiple results

Question

Context

I am currently optimizing a library for scientific computing. I'm fairly new to Commom Lisp. The functions I use are quite small and execute in approx. 10 ns to a few hundred nanoseconds on an average laptop. The performance already is quite close to C but I want every bit of speed I can get.

I use SBCL and its compiler notes and (time) macro to optimize (any general advice is welcome). I'm currently optimizing a mono-threaded string of computation that will be included in independent threads further down the road.

Problem

For the sake of argument, let's say I have a function (foo) that adds two lists of 3 variables term by term. Once optimized, it could be something like :

(defun foo (a1 a2 a3 b1 b2 b3)
  (declare (optimize (speed 3))
           (type double-float a1 a2 a3 b1 b2 b3))
  (list (+ a1 b1) (+ a2 b2) (+ a3 b3)))

And I time it with :

(time (dotimes (it 1000000 t) (foo 1.0d0 2.0d0 2.351d0 223.0d0 124.2d0 321d0)))

SBCL notes

From what I gather, the compiler complains that "casting" the result in a list is expensive.

note: doing float to pointer coercion (cost 13)

What I'd like

The complaints of SBCL seem sensible, so I'm looking for a way to do away with those pesky lists, which I have to peel back again anyway at some point later in order to feed them to other calculations. I'm willing to go low-level for this and lose (some) abstraction. A use case could look like :

(let ((res1 0d0)
      (res2 0d0)
      (res3 0d0))
 (declare (type double-float res1 res2 res3))
 (magic-abstraction-I-want res1 res2 res3 (foo 1d0 1d0 1d0 1d0 1d0 1d0)))

With this, I could string computations with minimal or nonexistent overhead, doing purely the computations that are needed and spend no time creating lists or accessing them.

What I tried/Think of trying

Inline

I'm seeing huge performance improvements on simple functions such as foo when putting :

(declaim (inline foo))

From what I understand, it sort of "expands" the function and writes it inline at the level it is called. Is that right, what does it do exactly ? Does this in fact do what I want ? Does it solve the problem of "casting to lists" in some way ?

(Also, feel free to give general speed optimization advice if you see from what I write that there is something that I might have misunderstood)

Edit : After learning about `values`

I modified foo, it is now :

(defun foo (a1 a2 a3 b1 b2 b3)
  (declare (optimize (speed 3))
           (type double-float a1 a2 a3 b1 b2 b3))
  (values (+ a1 b1) (+ a2 b2) (+ a3 b3)))

SBCL still outputs three notes on coercing return values to pointer. And I'm still consing bytes, as measured with :

(time (dotimes (it 1000000 t) (foo 1.0d0 2.0d0 2.351d0 223.0d0 124.2d0 321d0)))

However, the calls with inline are much faster and do not cons anything (as expected I guess):

(declaim (inline foo))
;;;;
(let ((r1 0d0) (r2 0d0) (r3 0d0)
      (a1 1d0) (a2 2d0) (a3 3d0)
      (b1 4d0) (b2 5d0) (b3 6d0))
  (declare (optimize (speed 3))
           (type double-float r1 r2 r3 a1 a2 a3 b1 b2 b3))
  (time (dotimes (it 1000000 t) (setf (values r1 r2 r3) (foo a1 a2 a3 b1 b2 b3)))))

The performance is exactly the same as :

(let ((r1 0d0) (r2 0d0) (r3 0d0)
      (a1 1d0) (a2 2d0) (a3 3d0)
      (b1 4d0) (b2 5d0) (b3 6d0))
  (declare (optimize (speed 3))
           (type double-float r1 r2 r3 a1 a2 a3 b1 b2 b3))
  (time (dotimes (it 1000000 t)
          (setf r1 (+ a1 b1))
          (setf r2 (+ a2 b2))
          (setf r3 (+ a3 b3)))))

Which is exactly what I wanted. The last very minor thing is that SBCL still complains about the optimization of foo, but I can just muffle it.

@sds: looks like he asked a bit more than just about multiple-values - the question is about 'scientific computing' with floats. — Rainer Joswig, May 28 '18 at 21:16
@RainerJoswig: the immediate question was about consing-free return of multiple values, which is answered by the `values` question. The secondary, more advanced questions about scientific computing should be asked more specifically either in a separate question or after heavy editing of this one. — sds, May 28 '18 at 21:39
While the first-order answer to this question is “observe that `values` exists,” I don’t think this question is a duplicate of the question about `values`. Specifically one will still get float-to-pointer coercion problems with multiple values and the answers to that question do not relate to this at all. There are also understanding problems related to not references in `foo2` where, assuming good compilation, one is only measuring the overhead of a function call. There’s also something to be said about `ftype` declarations. — Dan Robertson, May 28 '18 at 22:39
@DanRobertson: if the OP learns enough about `values` to edit the question appropriately - so that it stops being a dupe - I will gladly reopen it. — sds, May 29 '18 at 01:03
Alright, I edited my question with your suggestions about `values`. This proves to be what I wanted. Thank you everyone. — Frank Jefferson, May 29 '18 at 07:33
I do not think that closing this question was correct in the first place. It would have been a good opportunity to explain optimization and type declaration, and especially `the` has not yet been mentioned. And no, this question is not about `values`, and it actually wasn't in its first version either. — Svante, May 29 '18 at 07:40
@FrankJefferson presumably sbcl is still complaining about casting floats to pointers even if you return `values`? This is then about the memory representation in sbcl. It would be nice if someone could go into the details of why this happens and what one can do. “Just inline everything” doesn’t feel like a great solution, especially as it can slow things down. I wonder if sbcl can be convinced to allocate a float array on the stack, e.g. make an array of 3 double floats; declare it `dynamic-extent`; pass it to `foo` and have `foo` put the results in the array. This is normal for >>3 items. — Dan Robertson, May 29 '18 at 08:32
"presumably sbcl is still complaining about casting floats to pointers even if you return values?" It does indeed. The "just inline everything" seems to work for me at the moment, though I can conceive that it can be frustrating to experienced lispers. — Frank Jefferson, May 29 '18 at 08:41
See also https://stackoverflow.com/a/47451162/5747548 about the optimization notes. — jkiiski, May 29 '18 at 09:47

Dan Robertson · Accepted Answer · 2018-05-29T20:52:48.053

Ok so the first thing I want to do is explain what “float to pointer coercion” means.

(Prerequisites: an understanding of what memory is in a machine, bits and a rough idea of the C memory model).

In Lisp, which is dynamically typed, it is values that have types and not variables. Therefore you need some consistent memory representation which can be used to pass any value of any type and you need to be able to determine the type for this. (Note that in some strongly typed functional languages one may still need some kind of structured memory representation for the garbage collection to follow pointers, or a uniform way to represent all pointers in a word so that polymorphism works). The typical choice for this is a pointer, which is always a word (let’s say 8 bytes), and then the object which is pointed to may have a header with more information about its type. In C it would look something like:

struct LispObj_s {
  uint32_t length;
  uint16_t gc_data;
  uint16_t type_code;
  union {
    int64_t as_smallint;
    float as_float
    double as_double;
    char as_string;
    ... /* bignums, arrays, cons, structs, etc */
  }
}

typedef LispObj_s * LispObj

This is bad because then lots of commonly used things (namely integers) have a huge overhead: one word for the pointer to the object, one for the header (saying “I am an integer and I am 1 word long”), and one for the number itself. That’s an overhead of 200% and means that integers require a pointer dereference and possible cache miss. So there’s an optimisation: you use a few of the least significant bits of the pointers to say what the type is (these are free as everything is word-aligned so the three least significant bits are always 0). Then you can have that if (say) the last bit is 0 then you have a fixnum (a small number where arithmetic is fast, in this case 63 bits), if your last bits are 101 you have a pointer to a cons cell, if they are 111 then a pointer to a double, 011 a float and 001 a pointer to something else. Now integers, conses, and floats are cheaper which is good, but the disadvantage is you now need to be a bit more careful to make sure the tags are always right. The problem is we can’t do the same trick for doubles that we did for integers. It’s not really feasible to just chop 2 bits off a double like you do with integers, especially if you want compatibility with external programs.

In compiled code you want to be operating on the objects (like floats) themselves and so keep them in their raw form in registers or the stack. Only the compiled code can access them and it knows what type they are.

When you want to return or pass (e.g.) a float as an argument, the code receiving it doesn’t necessarily know what sort of object it will get and the code sending it certainly doesn’t know that the recipient is wanting a certain type of data. Thus you need to convert it to some uniform form that also says what type it is, and to do this you need to allocate some memory on the for it and pass a tagged pointer to that memory. When calling a function you don’t know what the called function is going to do with the float and therefore you can’t necessarily (see later) allocate it on the stack because the callee might e.g. put it in a global variable and then a bit later that variable would point to garbage data or unmapped memory. It’s even worse when you’re returning as you’d be allocating the float in your own stack frame which you’re about to destroy. Allocation tends to be slow (although much faster than in e.g. C; mainly because you’re writing to memory not in cache) and reading allocated objects tends to be slow (because you need to inspect and remove tags and dereference a pointer, and often cache miss) which is why sbcl complains when it’s optimiser has to allocate and box an object.

One thing one can do to make allocation faster is declare dynamic extent. This tells Lisp that you promise some object is not going to end up pointed to somewhere once the current dynamic scope ends and that therefore it may be allocated on the stack. Stack allocation is faster and more cache local and freeing stack allocation is basically free as the gc is not involved. You can do this when passing arguments but not when returning.

Using values is a bit like stack-allocating a returned list (the &rest parameter for a function can also be stack allocated if dynamic extent in sbcl) except that it’s possible and so this is a more efficient way to return multiple values. Unfortunately due to uniform memory representation, one can’t get an advantage from this for allocating floats.

Answer

However if we know about the function we are calling, we can stack-allocate the space for its return values in advance and then have the function put its values there. If we do this we can also avoid the float to pointer coercion in the following way: sbcl has an optimised representation for float arrays where instead of an array of pointers to floats, the floats are stored directly (e.g. as float* rather than float**):

(defun foo3 (a1 a2 a3 b1 b2 b3 result)
  (declare (optimize (speed 3) (safety 0))
           (type double-float a1 a2 a3 b1 b2 b3)
           (type (simple-array double-float (3)) result))
  (setf (aref result 0) (+ a1 b1)
        (aref result 1) (+ a2 b2)
        (aref result 2) (+ a3 b3))
  nil)

And then we can call this in the following convoluted way:

(let ((foo-result (make-array '(3) :element-type 'double-float :initial-element 0d0)))
  (declare (dynamic-extent foo-result))
  (foo a1 a2 a3 b1 b2 b3 foo-result)
  (let ((c1 (aref foo-result 0)) ...)
    ...))

This way we allocate space for the results on the stack, then foo3 fills them up, then we extract them from the stack, presumably into registers. This should be almost as fast as foo3 returning its result in registers, and crucially doesn’t need heap allocation ((Edit: I don’t think this is true) if sbcl assumes the function won’t change it can call past the type checking/unboxing straight into the meat of the function).

Unfortunately the syntax is unpleasant but there is a way round it in Lisp: macros. One could implement special macros for defining a function like foo3 and a special macro for calling and binding the results but this is bad because you have now split the world into two types of function and and you can’t use higher order functions for foo, and you might be doing macros-generating-macros which is tricky to debug. Instead the idea is this: we generate a fast version with our weird calling convention and a wrapper function which calls it and is set to inline. Then whenever we call the wrapper function we get the advantage of a fast call and all the cost of the wrapper is removed by the compiler.

(defmacro fast-defun (name num-values lambda-list &body body)
  (assert (and (integerp num-values) (plusp num-values)))
  (let ((fast-name (gensym (symbol-name name)))
        (result (gensym "RESULT"))
        (vars (loop repeat num-values collect (gensym "V")))
    `(progn
        (defun ,fastname ,(cons result lambda-list) ;;FIXME: this only works if lambda-list is a list of symbols
          (declare (optimize (speed 3) (safety 0))
                    (type (simple-array double-float (,num-values))))
          ;; Todo: move declarations here from body
          (multiple-value-bind ,vars (progn ,@body)
            ,@(loop for v in vars
                    for n from 0
                    collect `(setf (svref ,result ,n) ,v))
            nil))
       (declaim (inline ,name))
       (defun ,name ,lambda-list
         (let ((,result (make-array '(,num-values) :element-type 'double-float :initial-element 0d0)))
           (declare (dynamic-extent ,result) (optimize (speed 3)))
           (,fast-name ,result ,@lambda-list)
           (values ,@(loop for n below num-values collect `(aref ,result n))))))))

This macro isn’t in its most general form but you can use it like you would do (defun foo above but (fast-defun foo 3.

After some more experimenting, it seems that this is not actually any faster. Somehow consing is still happening and I can’t work out how to avoid it without inlining everything. I also tried making the input a dynamic extent array but that didn’t seem to help.

After looking around a bit more, I failed to find a way to look at intermediate results in the compilation pipeline which made me sad but I scanned around in the source code and I think that, although the compiler can (sometimes) specialise the calling of functions, it can’t specialise the returning of float values.

Outstanding. Great read. The macro will take me some time to understand, I'll play with it tomorrow and `time` results. — Frank Jefferson, May 29 '18 at 20:30
I’ll save you the effort: the results are worse than all the things in your question except returning a list — Dan Robertson, May 29 '18 at 20:31
Well that's too bad. No clue why that would be at the moment :/ I just notice that I've been having trouble timing functions the correct way, as SBCL keeps doing smart things when inputs are known at compile time or when the outputs do not actually need to be shown. If I get the time, I'll do a comparison with runtime random inputs and results pushed to a list, just to be sure. — Frank Jefferson, May 29 '18 at 20:47
Also why are you reluctant to using `inline` exactly ? The way I see it, it's just a compiler declaration I make that speeds things up quite a bit. Is there a downside ? The hyperspec gives no warning or anything. — Frank Jefferson, May 29 '18 at 20:55
It’s the same as excessively inlining in any language: you tend to end up with much more generated code so compilation is slower, less gets done for what you can fit in your cache (mainly if a function is used multiple times), it doesn’t mix with recursion, debugging can give worse information. The main issue is code size. You pay the code-size price for a function every time it is inlined and again every time one of its callers is inlined and so on. — Dan Robertson, May 29 '18 at 21:00
I see. Shouldn't be too much of a problem in my case, but I'll keep that in mind. — Frank Jefferson, May 29 '18 at 21:04
[some online notes](https://github.com/guicho271828/sbcl-wiki/wiki/Object-representation-%28lowtags%2C-widetags%2C-memory-consumption%2C-arrays%29) mention that in some cases functions returning floats don’t need to box their result but I couldn’t make it happen — Dan Robertson, May 29 '18 at 21:15
Note that you can also do `(declaim (inline f)) (defun f ...) (declaim (notinline f)` so the function won't be inlined by default, but can still be locally declared to be inlined with `(declare (inline f))`. This way you don't have to duplicate the code at every call site, but can optimize it when really needed. — jkiiski, May 30 '18 at 11:24