7

I am writing an application that uses the Kinect photo data and compares two frames against each other in F#. I am using Rob Miles Learn the Kinect Api page 74 as a guidepost but I am not using pointers and the performance is suffering. The byte from the Kinect frame is 1,228,800 bytes. I wrote the comparison like this:

member this.differenceCount(currentImageBytes) =
    if  previousImageBytes |> Seq.length = 0 then
        previousImageBytes <- currentImageBytes
        0
    else
        let bytes = Seq.zip previousImageBytes currentImageBytes
        let differenceCount = bytes |> Seq.mapi(fun i e -> i, e)
                                    |> Seq.filter(fun (i,e) -> i % 4 <> 0 )
                                    |> Seq.map snd
                                    |> Seq.filter(fun (p,c) -> p <> c)
                                    |> Seq.length
        previousImageBytes <- currentImageBytes
        differenceCount

When I run it, the screen lags because (I think) it is taking too long to process the arrays. In addition, the error rate is approaching 50%.

1) Am I approaching the problem wrong? 2) Is there a way to optimize my code to speed things up?

Jamie Dixon
  • 4,204
  • 4
  • 25
  • 47
  • 2
    hi - this problem is not F# specific (see https://stackoverflow.com/questions/43289/comparing-two-byte-arrays-in-net) - indeed you can *translate* all of the answers ther e - but if you really need speed (and are on Windwos) than grap you the PInvoke answer from there (you can put it into a C# dll if you want) and use it - you'll never get something as fast inside .net – Random Dev Oct 28 '14 at 11:56
  • I think that is the right track. I have the PInvoke working with memcmp but it is returning 0, 1 or -1. I need the total number of differences. Is there a native method that does that? – Jamie Dixon Oct 28 '14 at 12:19
  • oh - I guess I have missed this - I don't know a native version for this no - it's a wild guess but I would bet that the best way to do something like this is to move it to GPU shaders (or write a low-level C/C++/Asm implementation) – Random Dev Oct 28 '14 at 12:23
  • :-( I am not sure what a GPU shader is. I'll see if someone has written a C/C++ implementation that I can use like this: http://stackoverflow.com/questions/15313646/fast-counting-the-number-of-equal-bytes-between-two-arrays – Jamie Dixon Oct 28 '14 at 12:25
  • A shader is a simple programm that runs on one of the shader units in a GPU - and as a GPU has very many of those you can get massive parallel processing power, and of course this is used to grank out your newest game - but I meant it as a kind of half-joke - right now I would look for a native implementation or go with what John proposed – Random Dev Oct 28 '14 at 12:33
  • John's answer is too slow so I am looking for a native implementation. I can also use the unsafe C# code in the book (which I know works) also – Jamie Dixon Oct 28 '14 at 12:54
  • You could investigate writing an SSE2 algorithm in C++/CLI and then consuming that from F# – Steve Oct 28 '14 at 17:32
  • Using a loop will avoid all of the extra allocations. The GC pressure by this will likely be a huge culprit... – Reed Copsey Oct 28 '14 at 18:38
  • For further information re. GPU processing in F# (as mentioned by @Carsten), see the [F# Foundation website](http://fsharp.org/use/gpu/): "GPU execution is a technique for high-performance financial, image processing and other data-parallel numerical programming." – Marc Sigrist Oct 28 '14 at 19:16

3 Answers3

10

Your sequence mapping/filtering via tuples causes a lot of boxing overhead. The following example avoids the boxing and works in parallel, which (on my machine) is 50 times faster.

let parallelRanges =
    let kinectSize = 1228800
    let subSize = kinectSize / Environment.ProcessorCount
    let iMax = kinectSize - 1
    let steps = [| -1 .. subSize .. iMax |]
    steps.[steps.Length - 1] <- iMax
    steps |> Seq.pairwise |> Seq.toArray |> Array.map (fun (x, y) -> x + 1, y)

let countDiffs (prevBytes:byte[]) (curBytes:_[]) =
    let count (fromI, toI) =
        let rec aux i acc =
            if i > toI then acc
            elif i % 4 <> 0 && prevBytes.[i] <> curBytes.[i] then 
                aux (i + 1) (acc + 1)
            else aux (i + 1) acc
        aux fromI 0

    parallelRanges
    |> Array.Parallel.map count
    |> Array.sum
Marc Sigrist
  • 3,964
  • 3
  • 22
  • 23
  • Curious - how much faster is this without running in parallel? I'd expect it to still be quicker than the previous one. – Reed Copsey Oct 28 '14 at 18:52
  • @Reed: When using only one core, it's still ca. 20x faster than the original. Using four cores, it's about 50x faster. – Marc Sigrist Oct 28 '14 at 19:00
3

Without some test data to profile and compare I would do something like

let len = Array.length previousImageBytes
let mutable count = 0
for i in 0 .. 4 .. (len-1) do
    if previousImageBytes.[i] <> currentImageBytes.[i] then
        count <- count+1

which has one pass over the data rather than 5 and avoids the seq functions which are slow.

John Palmer
  • 25,356
  • 3
  • 48
  • 67
2

I have experimented with several different implementations of byte array compare in F# for a project I've been working on.

Broadly speaking, for this particular problem, I found that "more idiomatic F#" == "slower". So I ended up with this:

// this code is very non-F#-ish.  but it's much faster than the
// idiomatic version which preceded it.

let Compare (x:byte[]) (y:byte[]) =
    let xlen = x.Length
    let ylen = y.Length
    let len = if xlen<ylen then xlen else ylen
    let mutable i = 0
    let mutable result = 0
    while i<len do
        let c = (int (x.[i])) - int (y.[i])
        if c <> 0 then
            i <- len+1 // breaks out of the loop, and signals that result is valid
            result <- c
        else
            i <- i + 1
    if i>len then result else (xlen - ylen)

I've violated several guidelines of good functional programming. I've used if-then-else instead of match. I've got mutables. And I faked a break statement with a cheesy change to a loop index variable.

Nonetheless, this ended up being dramatically faster than anything more idiomatic. Even something as simple as a closure with tail recursion was slower. (The profiler indicates that a closure involves a memory allocation on the heap.)

The project containing this code is on GitHub:

https://github.com/ericsink/LSM

The commit history of fs/lsm.fs should contain a record of my other failed attempts.

Eric Sink
  • 342
  • 1
  • 7