performance of static member constraint functions

Question

I'm trying to learn static member constraints in F#. From reading Tomas Petricek's blog post, I understand that writing an inline function that "uses only operations that are themselves written using static member constraints" will make my function work correctly for all numeric types that satisfy those constraints. This question indicates that inline works somewhat similarly to c++ templates, so I wasn't expecting any performance difference between these two functions:

let MultiplyTyped (A : double[,]) (B : double[,]) =
    let rA, cA = (Array2D.length1 A) - 1, (Array2D.length2 A) - 1
    let cB = (Array2D.length2 B) - 1
    let C = Array2D.zeroCreate<double> (Array2D.length1 A) (Array2D.length2 B)
    for i = 0 to rA do
        for k = 0 to cA do
            for j = 0 to cB do
                C.[i,j] <- C.[i,j] + A.[i,k] * B.[k,j]
    C

let inline MultiplyGeneric (A : 'T[,]) (B : 'T[,]) =
    let rA, cA = Array2D.length1 A - 1, Array2D.length2 A - 1
    let cB = Array2D.length2 B - 1
    let C = Array2D.zeroCreate<'T> (Array2D.length1 A) (Array2D.length2 B)
    for i = 0 to rA do
        for k = 0 to cA do
            for j = 0 to cB do
                C.[i,j] <- C.[i,j] + A.[i,k] * B.[k,j]
    C

Nevertheless, to multiply two 1024 x 1024 matrixes, MultiplyTyped completes in an average of 2550 ms on my machine, whereas MultiplyGeneric takes about 5150 ms. I originally thought that zeroCreate was at fault in the generic version, but changing that line to the one below didn't make a difference.

let C = Array2D.init<'T> (Array2D.length1 A) (Array2D.length2 B) (fun i j -> LanguagePrimitives.GenericZero)

Is there something I'm missing here to make MultiplyGeneric perform the same as MultiplyTyped? Or is this expected?

edit: I should mention that this is VS2010, F# 2.0, Win7 64bit, release build. Platform target is x64 (to test larger matrices) - this makes a difference: x86 produces similar results for the two functions.

Bonus question: the type inferred for MultiplyGeneric is the following:

val inline MultiplyGeneric :
   ^T [,] ->  ^T [,] ->  ^T [,]
    when ( ^T or  ^a) : (static member ( + ) :  ^T *  ^a ->  ^T) and
          ^T : (static member ( * ) :  ^T *  ^T ->  ^a)

Where does the ^a type come from?

edit 2: here's my testing code:

let r = new System.Random()
let A = Array2D.init 1024 1024 (fun i j -> r.NextDouble())
let B = Array2D.init 1024 1024 (fun i j -> r.NextDouble())

let test f =
    let sw = System.Diagnostics.Stopwatch.StartNew()
    f() |> ignore
    sw.Stop()
    printfn "%A" sw.ElapsedMilliseconds

for i = 1 to 5 do
    test (fun () -> MultiplyTyped A B)

for i = 1 to 5 do
    test (fun () -> MultiplyGeneric A B)

I observed the same issue a while back too: http://stackoverflow.com/q/2945880/236255 — Stephen Swensen, Apr 08 '15 at 01:54

Daniel · Answer 1 · 2013-02-21T17:42:42.603

3

I'd like to see your benchmarks. I don't get the same results (VS 2012 F# 3.0 Win 7 64-bit).

let m = Array2D.init 1024 1024 (fun i j -> float i * float j)

let test f =
  let sw = System.Diagnostics.Stopwatch.StartNew()
  f() |> ignore
  sw.Stop()
  printfn "%A" sw.Elapsed

test (fun () -> MultiplyTyped m m)
> 00:00:09.6013188

test (fun () -> MultiplyGeneric m m)
> 00:00:09.1686885

Decompiling with Reflector, the functions look identical.

Regarding your last question, the least restrictive constraint is inferred. In this line

C.[i,j] <- C.[i,j] + A.[i,k] * B.[k,j]

because the result type of A.[i,k] * B.[k,j] is unspecified, and is passed immediately to (+), an extra type could be involved. If you want to tighten the constraint you can replace that line with

let temp : 'T = A.[i,k] * B.[k,j]
C.[i,j] <- C.[i,j] + temp

That will change the signature to

val inline MultiplyGeneric :
  A: ^T [,] -> B: ^T [,] ->  ^T [,]
    when  ^T : (static member ( * ) :  ^T *  ^T ->  ^T) and
          ^T : (static member ( + ) :  ^T *  ^T ->  ^T)

EDIT

Using your test, here's the output:

//MultiplyTyped
00:00:09.9904615
00:00:09.5489653
00:00:10.0562346
00:00:09.7023183
00:00:09.5123992
//MultiplyGeneric
00:00:09.1320273
00:00:08.8195283
00:00:08.8523408
00:00:09.2496603
00:00:09.2950196

Here's the same test on ideone (with a few minor changes to stay within the time limit: 512x512 matrix and one test iteration). It runs F# 2.0 and produced similar results.

edited Feb 21 '13 at 17:42

answered Feb 21 '13 at 16:39

Daniel

47,404
11
101
179

I now get the `^a`, +1. I'll edit the question with test numbers. – vlad Feb 21 '13 at 16:43
Sending `let _ = MultiplyXXX` to FSI produces numbers similar to yours. I wonder it's due to compilation overhead associated with inlining. In my tests `MultiplyGeneric` is marginally faster in a release build .exe. – Daniel Feb 21 '13 at 16:47
I added the testing code (used your test function). I'm simply running without debugging in release mode. I also tried just running the .exe, same results: Generic is ~twice as slow. The only difference I can see is I'm using VS2010 / F# 2. – vlad Feb 21 '13 at 16:56
Running inside VS, `MultiplyGeneric` is slightly slower, but running a release build from the console it's about a second faster using your test. – Daniel Feb 21 '13 at 17:04
I did: built in release mode, opened command prompt, ran the exe: same results :( `Typed` takes ~2550ms, `Generic` takes ~5250 ms – vlad Feb 21 '13 at 17:07
I don't have access to VS2012 right now, but I'll definitely try that asap. – vlad Feb 21 '13 at 17:08
1

I added the output to my answer. I'm stumped. I don't have VS 2010 installed to compare. – Daniel Feb 21 '13 at 17:09
1

I tried it on ideone, which uses F# 2.0. Got the same results. Link is in my answer. – Daniel Feb 21 '13 at 17:17
ohhhh... it seems like the difference is between x86 and x64. I originally had the platform target as x64, but changing it to x86 (and adjusting the matrix sizes) gives me similar results for both functions. – vlad Feb 21 '13 at 17:21
2

You're absolutely right. I'm not sure why, but I was targeting x86. Changing to AnyCPU on my 64-bit machine causes `MultiplyGeneric` to be twice as slow. – Daniel Feb 21 '13 at 17:24
1

I should correct my previous comment: the timing for `MultiplyGeneric` remains about the same, but `MultiplyTyped` is twice as fast. – Daniel Feb 21 '13 at 17:27

score 3 · Answer 2 · answered Feb 21 '13 at 17:10

3

Good question. I'll answer the easy part first: the ^a is just part of the natural generalization process. Imagine you had a type like this:

type T = | T with
    static member (+)(T, i:int) = T
    static member (*)(T, T) = 0

Then you can still use your MultiplyGeneric function with arrays of this type: multiplying elements of A and B will give you ints, but that's okay because you can still add them to elements of C and get back values of type T to store back into C.

As to your performance question, I'm afraid I don't have a great explanation. Your basic understanding is right - using MultiplyGeneric with double[,] arguments should be equivalent to using MultiplyTyped. If you use ildasm to look at the IL the compiler generates for the following F# code:

let arr = Array2D.zeroCreate 1024 1024
let f1 = MultiplyTyped arr
let f2 = MultiplyGeneric arr

let timer = System.Diagnostics.Stopwatch()
timer.Start()

f1 arr |> ignore

printfn "%A" timer.Elapsed
timer.Restart()

f2 arr |> ignore

printfn "%A" timer.Elapsed

then you can see that the compiler really does generate identical code for each of them, putting the inlined code for MultipyGeneric into an internal static function. The only difference that I see in the generated code is in the names of locals, and when running from the command line I get roughly equal elapsed times. However, running from FSI I see a difference similar to what you've reported.

It's not clear to me why this would be. As I see it there are two possibilities:

FSI's code generation may be doing something slightly different than the static compiler
The CLR's JIT compiler may be treat code generated at runtime slightly differently from compiled code. For instance, as I mentioned my code above using MultiplyGeneric actually results in an internal method that contains the inlined body. Perhaps the CLR's JIT handles the difference between public and internal methods differently when they are generated at runtime than when they are in statically compiled code.

answered Feb 21 '13 at 17:10

kvb

54,864
2
91
133

+1, also a great explanation of the 'extra' `^a` type. As I mentioned to @Daniel, I tried running outsite of VS, and I still get the same results. Are you running this with VS2012 / F# 3? – vlad Feb 21 '13 at 17:13
1

The JIT compiler definitely treats runtime generated code differently from compiled code; at least the "normal" dynamic code as opposed to a complete assembly. However, both are non-dynamic, right? I mean, the inlining happens compile-time, not runtime, right? – Eamon Nerbonne Feb 21 '13 at 17:17
1

@kvb: As vlad discovered, the issue is exclusive to x64 build target. What gives? :-) – Daniel Feb 21 '13 at 17:25
@Daniel -- The x64 version of the CLR performs additional optimizations that the x86 version doesn't. So given the same assembly on an x86 and and x64 machine, the performance is usually fairly similar; for certain cases though, the performance will be significantly better on an x64 machine. – Jack P. Feb 21 '13 at 18:03
@EamonNerbonne - when using F# interactive the code is compiled via Reflection.Emit. The inlining happens at compile-time either way, though. – kvb Feb 21 '13 at 18:10
@vlad - Yes, I'm running with VS2012/F#3. With your test code, there are actually a few subtle differences in the generated IL that could possibly affect how they are JITted (e.g. the generic code loads A and B from fields, while the non-generic code loads them from arguments). – kvb Feb 21 '13 at 18:14
@vlad - On my system I get better, identical performance using x64 (~4 seconds per call), but worse, different performance using x86 (~7.5 seconds per non-generic call, ~8.5 seconds per generic call). Those numbers are when using your test code. When using my test code I get identical performance even on x86 (except when using F# Interactive). – kvb Feb 21 '13 at 18:17
@vlad In that case -- it could also be (as kvb said) differences in what the JIT compiler generates for static and runtime-generated code. IIRC, the JIT compiler disables certain optimizations for runtime-generated code (for security reasons). – Jack P. Feb 21 '13 at 18:18
@kvb I don't see a real difference between your test code and my test code. are you using different code for multiplying the matrices? – vlad Feb 21 '13 at 18:29
1

@vlad The difference is fairly subtle (and the corresponding IL differences are subtle too) - I'm partially applying each function first, then applying the result later. The bigger point is that inlining works as you expect, but that minor differences in IL can cause the JITted machine code to have noticeably different performance. – kvb Feb 21 '13 at 18:50
@kvb I copied your code and still saw a discrepancy: x64 Typed is faster than all the other combinations (x64 vs x86 and Typed vs Generic) – vlad Feb 21 '13 at 19:35
"FSI's code generation may be doing something slightly different than the static compiler": It certainly does. For instance, in FSI, it is legal to call private functions from inline public (or internal) functions; whereas in a source code file, the compiler will raise a "not sufficiently accessible" error. – Marc Sigrist Feb 28 '13 at 17:02

performance of static member constraint functions

2 Answers2