F# code optimization or is it really that slow?

Question

I was looking for a way to do proper algorithmic coding using .NET with all the benefits of modern languages (e.g. I like a strong type checking, operator overloading, lambdas, generic algorithms). Usually I write my algorithms (mainly image prozessing) in C++. As F# as a language seems to be quite interesting I played a bit, but it seems to be very slow. As a most simple test I just did some array manipulation -> brightness increase of an image:

let r1 = rgbPixels |> Array.map (fun x -> x + byte(10) )

It seems to be a factor of at least 8 times slower than a compared C++ implementation - even worse for more complex algorithms e.g. 2D convolution. Is there some faster way or do I miss any specific compiler settings (yes, building release with optimization on...)? I'm willing to pay for the nice and high abstracion, but such an overhead is not nice (I would need to parallelize on 8 cores to compensate :) ) - at least it destroys the motivation to learn further... My other option would be to leave my heavier algos in C++ and interface with manged C++, but this is not nice, as mainting the managed wrapper would be quite a burden.

It's possible it would be inlined, but you could start by replacing the call to `byte` with `10uy`. — Daniel, Feb 07 '12 at 20:55
Can you post your C++ implementation on [ideone](http://ideone.com/) so we have something to work with? — Daniel, Feb 07 '12 at 20:58
Are [these results](http://ideone.com/Z4Gvj) comparable to yours? — Daniel, Feb 07 '12 at 21:00
The C++ code I was reffering too was from a codeproject article I wrote some time back : http://www.codeproject.com/Articles/216143/Zero-abstraction-overhead-with-modern-compilers To make it short: in C++ I wrote some kind of iterator going over the raw image data array - pretty straight forward : just aloop over all elements like in F#... — TheBigW, Feb 07 '12 at 21:16
@Daniel, yes they are. If I replace the number of 10000 with the same amount of pixels I have (2000x3000 24Bit jpg), I get exacty the same result (342 ms) - same as for your example I get close to 0 ms — TheBigW, Feb 08 '12 at 19:12

Guvante · Answer 1 · 2012-02-07T22:08:14.697

6

If you are worried about performance one of the important things to remember is that F# by default does not mutate anything. This requires copying in many naive implementations of algorithms, such as the one you described.

EDIT: I have no idea why, but simple tests of the following code provides inferior results to Array.map. Be sure to profile any algorithm you try when performing these kinds of optimizations. However I got very similar results between for and map.

Array.map creates a new array for the result of the operation, instead you want Array.iteri.

rgbPixels |> Array.iteri (fun i x -> rgbPixels.[i] <- x + 10uy)

Note that this could be wrapped up in your own module as below

module ArrayM =
    let map f a = a |> Array.iteri (fun i x -> a.[i] <- f x)

Unfortunately this is a necessary evil since one of the primary tenants of functional programming is to stick to immutable objects as much as your algorithm will allow, and then once finished, change to mutation where performance is critical. If you know your performance is critical from the get-go, you will need to start with these kinds of helpers.

Also note that there is probably a library out there that provides this functionality, I just don't know of it off hand.

edited Feb 07 '12 at 22:08

answered Feb 07 '12 at 21:02

Guvante

18,775
1
33
64

1

Or, even faster: `for i = 0 to rgbPixels.Length-1 do rgbPixels.[i] <- rgbPixels.[i] + 10uy`. – Daniel Feb 07 '12 at 21:06
@Daniel: I was trying to keep as close to the original code as possible. I would assume that the function is inlined by the JIT anyway. – Guvante Feb 07 '12 at 21:15
@Guvante: I tested with FSI: the loop takes less than half the time...probably due to no bounds-checking, as kvb mentioned. – Daniel Feb 07 '12 at 21:19
to give some results (compared to C++ code from the code project article I mentioned above) C++ -> 40 ms, F# - my initial solution -> 269 ms, F# iteri -> 205 ms, F# for loop (surprisingly!!) -> 343 ms. Which leaves with very positive view and pessimistic C++ timing still a factor of 5... – TheBigW Feb 07 '12 at 21:25
1

@TheBigW: The CLR removes bounds-checking only in very specific circumstances. You may want to check [this question](http://stackoverflow.com/questions/631123/what-is-the-special-case-with-the-foreach-loop-that-eliminates-bounds-checking) and make sure your code conforms. The `for` loop should definitely be the fastest. – Daniel Feb 07 '12 at 21:36
1

@TheBigW: Might I ask how you are timing this? You have to be careful to avoid accidentally counting the JITing and CLR load times, which severally impact small tests like this. For instance if the combined cost of those for your sample were 150 ms, that would lead to a very minor overhead of 20% (for the simple case, potentially less or more for more complex cases) – Guvante Feb 07 '12 at 21:40
let t = System.Diagnostics.Stopwatch.StartNew() printfn "elapsed time: %d ms" t.ElapsedMilliseconds --> whatever method used goes inbetween. If I remove it I still get 1 ms, but otherwise this seems to work – TheBigW Feb 07 '12 at 21:46
@TheBigW: Are you calling `stopwatch.Stop()` before printing the time? `printfn` uses reflection and is quite slow. – Daniel Feb 07 '12 at 21:51
My results were: for 120, map 157, iteri 359, smart iteri 577. But that was just in the interactive prompt. It definitely looks like a beefier example to test with is warranted. – Guvante Feb 07 '12 at 22:10
@Daniel: I'm a bit relieved though adding stopwatch.stop did not change the result in any way (I guess I would have left this post in big and quite shame if this would have been the issue :) ). – TheBigW Feb 08 '12 at 19:15

score 5 · Answer 2 · answered Feb 07 '12 at 21:18

5

I think that it's safe to say that idiomatic F# will often fail to match the performance of optimized C++ for array manipulation, for a few reasons:

Array accesses are checked against the bounds of the array in .NET to ensure memory safety. The CLR's JIT compiler is able to elide bounds checks for some stereotypical code, but this will typically require you to use a for loop with explicit bounds rather than more idiomatic F# constructs.
There is typically a slight amount of overhead to using abstractions like lambdas (e.g. fun i -> ...). In tight loops, this small overhead can end up being significant compared to the work that's being done in the loop body.
As far as I am aware, the CLR JIT compiler doesn't take advantage of SSE instructions to the same degree that C++ compilers are able to.

On the other side of the ledger,

You will never have a buffer overflow in F# code.
Your code will be easier to reason about. As a corollary, for a given level of total code complexity, you can often implement a more complex algorithm in F# than you can in C++.
If necessary you can write unidiomatic code which will run at closer to the speed of C++, and there are also facilities for interoperating with unsafe C++ code if you find that you need to write C++ to meet your performance requirements.

answered Feb 07 '12 at 21:18

kvb

54,864
2
91
133

1. Except that you have this loop in Array.map 2. Which can be avoided by inlining 3. On the other hand parallelization is much easier in functional languages (as in adding single additional function call) and vector operations can be technically added in similar way. – Maciej Piechotka Feb 07 '12 at 21:25
I totally agree with all points above and I'm absolutelly willing to sacrifice some performance (which was my motivation to start playing with F#) but the measured results above show a factor of 5 for a simple operation (thats why I was guessign that I did something wrong). Coming to parallelization: if I start with a factor of 5 I would keep 5 cores busy and than would still probably just be where I was before with one core in C++... – TheBigW Feb 07 '12 at 21:31
@Maciej - For code that needs to write to the array, none of the higher order functions in the `Array` module is equivalent to a for loop (that is, in `arr |> Array.iteri (fun i v -> arr.[i] <- v + 1)` the check for reading will be elided, but there will still be a check for writing at each iteration). – kvb Feb 07 '12 at 21:36
@Maciej - Regarding inlining, at a theoretical level it's not clear that there is a tractable general way of determining when inlining will be profitable - sometimes the code bloat actually degrades performance. At a practical level, you can easily observe that the F# compiler does not always (ever?) inline arguments to `Array.iteri` and friends. – kvb Feb 07 '12 at 21:40
@TheBigW - Part of the problem is that your problem is so simple. This means that any per-iteration overhead will appear to be a big performance hit, while an algorithm that does more each time through the loop would not look so bad. – kvb Feb 07 '12 at 21:40
I will give that a try. I planned to rewrite my simple C++ sample (2D image convolution) and post the results. Anyway I guess this will be even worse -> 3 nested loops (x-dimension, y-dimension and iterating the filter matrix).... – TheBigW Feb 07 '12 at 21:50
2

@TheBigW: I like F# a lot and would rarely recommend against using it, but it _is_ possible it isn't the right tool for _every_ job. – Daniel Feb 07 '12 at 21:53
You are right. I have never seen any tool perfect for every job. Same is true for almost every programing Language (including C++). For me F# was closing the gap for functionality, which I was missing in the .Net world for generic algorithms. On the ther hand I'm tired of C++ as a frontend language for UIs - problem is that C++ algorithms and .NET frontend don't fit so well together - allways some manual written glue code needs to be written and maintained... I still will continue to learn F#. I already like it. – TheBigW Feb 08 '12 at 19:37

F# code optimization or is it really that slow?

2 Answers2