Faster way to combine bitmaps with custom algorithm?

Question

I’m trying to combine 2 images with a certain algorithm. But in its current state it is too slow. It takes about 70ms to combine two 512x512 images. This is OK but as soon as the images get bigger the time it takes to combine them increases.

This is the code in c# (Fast work with Bitmaps in C#)

var t = new Vec3f(0);
var u = new Vec3f(0);
var r = new Vec3f(0);

for (int i = 0; i < bData1.Height; ++i)
{
    for (int j = 0; j < bData1.Width; ++j)
    {
        byte* dataBase = bData1Scan0Ptr + i * bData1.Stride + j * m_BitsPerPixel / 8;
        byte* dataDetail = bData2Scan0Ptr + i * bData2.Stride + j * m_BitsPerPixel / 8;

        byte* dataCombined = bDataCombinedScan0Ptr + i * bDataCombined.Stride + j * m_BitsPerPixel / 8;

        t.x = (dataBase[2] / 255.0f) * 2.0f - 1.0f;
        t.y = (dataBase[1] / 255.0f) * 2.0f - 1.0f;
        t.z = (dataBase[0] / 255.0f) * 2.0f;

        u.x = (dataDetail[2] / 255.0f) * -2.0f + 1.0f;
        u.y = (dataDetail[1] / 255.0f) * -2.0f + 1.0f;
        u.z = (dataDetail[0] / 255.0f) * 2.0f - 1.0f;

        r = t * t.Dot(u) - u * t.z;

        r.Normalize();

        //Write data to our new bitmap
        dataCombined[2] = (byte)Math.Round((r.x * 0.5f + 0.5f) * 255.0f);
        dataCombined[1] = (byte)Math.Round((r.y * 0.5f + 0.5f) * 255.0f);
        dataCombined[0] = (byte)Math.Round((r.z * 0.5f + 0.5f) * 255.0f);

        m_VectorImageArray[index, i, j] = t;    //base
        m_VectorImageArray[index + 1, i, j] = u;  //detail
        m_VectorImageArray[index + 2, i, j] = r;  //Combined
    }
}

m_CombinedBitmap.UnlockBits(bDataCombined);

Because I wanted to speed this up I also tried to make a c++ dll and load it in my C# project with DLLImport. I've implemented this vector class (http://fastcpp.blogspot.co.uk/2011/12/simple-vector3-class-with-sse-support.html) thinking it would result in a significant speed gain, but unfortunately it turned out to be only ~10ms faster.

I want to make this faster because I’d like to update the image real-time (looping over the vectors which are stored in m_VectorImageArray).

The problem isn't related to reading/writing to the bitmap but more to the algorithm itself. I don’t think I can use a parallel.for because the pixels need to be in the exact same order, or is this possible after all?

You can parallelize this.. you're not working with varying sets of data. Split each part of the image into blocks and process them on their own threads. Assuming you split them up appropriately.. you shouldn't run into any threading issues in terms of the algorithm. — Simon Whitehead, Dec 09 '13 at 00:07
can you provide a set of image to test and proper code to do the full testing? — Fredou, Dec 09 '13 at 00:33
@Simon Whitehead Thanks! I’ll have to try this out together with all the other changes! — VincentC, Dec 09 '13 at 21:22
@acfrancis I’ve profiled it, but because this is the first time I actually need to optimize something I might have misread the information the profiler gave me. [profiler](https://dl.dropboxusercontent.com/u/8801520/Profiler.jpg) I thought this would mean most of the time is spent in my Vec3f class, so I tried to optimize that one. Thanks! — VincentC, Dec 09 '13 at 21:22
@Fredou I’ve made a test project with images (512x512 and 4096x4096) and uploaded it here: [Solution](https://dl.dropboxusercontent.com/u/8801520/CombineImages.rar) Thanks! — VincentC, Dec 09 '13 at 21:23
thanks for the code, the answer that i provided should be surprising, I will now look at the m_VectorImageArray issue — Fredou, Dec 09 '13 at 23:58
in fact my solution seem to solve the m_VectorImageArray issue since you have the needed "information" in the dictionary, you just need to move it out of the method and remove the clear. you could loop thought a lot of picture reusing this cache :-) — Fredou, Dec 10 '13 at 00:02
That's a good start. Can your profiler narrow it down a bit more to the lines of code (or method call) where the bottleneck is? An educated guess is that it might be the Normalize() method because everything else looks like simple assignments (but they could be properties with significant code in the setter). — acfrancis, Dec 10 '13 at 09:41

Jigsore · Answer 1 · 2013-12-09T04:45:09.017

I reduced the number of multiplications and divisions performed in every iteration, so I guess it should go a little faster. Not tested.

var t = new Vec3f(0);
var u = new Vec3f(0);
var r = new Vec3f(0);

int xIncr = m_BitsPerPixel / 8;
byte* dataBase = bData1Scan0Ptr;
byte* dataDetail = bData2Scan0Ptr;
byte* nextBase = dataBase + bData1.Stride;
byte* nextDetail = dataDetail + bData2.Stride;

byte* dataCombined = bDataCombinedScan0Ptr;
byte* nextCombined = dataCombined + bDataCombined.Stride;

for (int y = 0; y < bData1.Height; ++y)
{
    for (int x = 0; x < bData1.Width; ++x)
    {
        t.x = (dataBase[2] / 255.0f) * 2.0f - 1.0f;
        t.y = (dataBase[1] / 255.0f) * 2.0f - 1.0f;
        t.z = (dataBase[0] / 255.0f) * 2.0f;

        u.x = (dataDetail[2] / 255.0f) * -2.0f + 1.0f;
        u.y = (dataDetail[1] / 255.0f) * -2.0f + 1.0f;
        u.z = (dataDetail[0] / 255.0f) * 2.0f - 1.0f;

        r = t * t.Dot(u) - u * t.z;

        r.Normalize();

        //Write data to our new bitmap
        dataCombined[2] = (byte)Math.Round((r.x * 0.5f + 0.5f) * 255.0f);
        dataCombined[1] = (byte)Math.Round((r.y * 0.5f + 0.5f) * 255.0f);
        dataCombined[0] = (byte)Math.Round((r.z * 0.5f + 0.5f) * 255.0f);

        m_VectorImageArray[index, y, x] = t;    //base
        m_VectorImageArray[index + 1, y, x] = u;  //detail
        m_VectorImageArray[index + 2, y, x] = r;  //Combined

        dataBase += xIncr;
        dataDetail += xIncr;
        dataCombined += xIncr;
    }
    dataBase = nextBase;
    nextBase += bData1.Stride;
    dataDetail = nextDetail;
    nextDetail += bData2.Stride;
    dataCombined = nextCombined;
    nextCombined += bDataCombined.Stride;
}

m_CombinedBitmap.UnlockBits(bDataCombined);

Thanks for this! I’ve implemented it and it saves ~300ms on a 4096x4096 image. It used to take ~4800ms. — VincentC, Dec 09 '13 at 21:23
You could try also changing this kind of code `t.x = (dataBase[2] / 255.0f) * 2.0f - 1.0f;` to something like `t.x = dataBase[2] * (2 / 255.0f) - 1.0f;`. It might help, but I'm not sure. — Jigsore, Dec 09 '13 at 21:35
Wow! Thanks again! I've changed it and it now only takes ~3600ms. I do however have a small question about it, is this faster because the divide is done at compile time so you get a multiply at runtime? Like Jason S explained [here](http://stackoverflow.com/questions/226465/should-i-use-multiplication-or-division) — VincentC, Dec 09 '13 at 21:55
Yes, the compiler see this `2 / 255.0f` as a literal because the operation can be performed at compile time (and that's what it does). Thus, you avoid a slow division. — Jigsore, Dec 09 '13 at 22:04

score 1 · Answer 2 · answered Dec 09 '13 at 05:43

1

I would suggest removing the many divide by 255 statements and scaling the math so you also remove the multiplies by 255 as well. You could probably convert the whole thing to integer math as well.

The other thing to look at is your memory access pattern or method calls for m_VectorImageArray -- are they slowing this down? Comment that out to find out. Where is the declaration of that object?

answered Dec 09 '13 at 05:43

Dithermaster

6,223
1
12
20

I’ve commented out the m_VectorImageArray and it saved ~500ms! Do you know a way to reduce the time it takes to fill up the array? (~4800ms before). It is declared just above the two for-loops like this: m_VectorImageArray = new Vec3f[(m_ImagesCount - 1) / 10 + 1 + m_ImagesCount, bData1.Width, bData1.Height]; I’ll try and remove the /255 and *255 and convert it all to int math too. I did the latter already and, although it didn’t really work out very well with the /255 and *255 still there, it did save ~900ms. Thanks! – VincentC Dec 09 '13 at 21:23

Fredou · Accepted Answer · 2013-12-10T00:04:59.267

I'm not sure if this could make sense but what I did is simply created a dictionary for previously calculated value(and some cleanup...), main reason is after doing some profiling, 60 to 70% of the cpu time is with these 2 lines:

    r = t * t.Dot(u) - u * t.z;

    r.Normalize();

so here it is;

    private static unsafe void CombineImage(Bitmap image1, Bitmap image2, int index)
    {
        Dictionary<long, int> testDict = new Dictionary<long, int>(); //the magic is wit this dictionary

        var combinedBitmap = new Bitmap(image1.Width, image1.Height, image1.PixelFormat);

        BitmapData bData1 = image1.LockBits(new Rectangle(0, 0, image1.Width, image1.Height), ImageLockMode.ReadOnly, image1.PixelFormat);
        BitmapData bData2 = image2.LockBits(new Rectangle(0, 0, image2.Width, image2.Height), ImageLockMode.ReadOnly, image2.PixelFormat);
        BitmapData bDataCombined = combinedBitmap.LockBits(new Rectangle(0, 0, combinedBitmap.Width, combinedBitmap.Height), ImageLockMode.WriteOnly, combinedBitmap.PixelFormat);

        byte* dataBase = (byte*)bData1.Scan0.ToPointer();
        byte* dataDetail = (byte*)bData2.Scan0.ToPointer();
        byte* dataCombined = (byte*)bDataCombined.Scan0.ToPointer();

        const int bitsPerPixel = 24;
        const int xIncr = bitsPerPixel / 8;

        var t = new Vec3f(0);
        var u = new Vec3f(0);
        var r = new Vec3f(0);

        int h = bData1.Height, w = bData1.Width;
        long key;
        int value;

        Stopwatch combineStopwatch = Stopwatch.StartNew();
        for (int y = 0; y < h; ++y)
        {
            for (int x = 0; x < w; ++x)
            {
                //real magic!
                key = dataBase[0] | (dataBase[1] << 8) | (dataBase[2] << 16) | (dataDetail[0] << 24) | (dataDetail[1] << 32) | (dataDetail[2] << 40);
                if (testDict.ContainsKey(key))
                {
                    value = testDict[key];
                    dataCombined[0] = (byte)(value & 255);
                    dataCombined[1] = (byte)((value >> 8) & 255);
                    dataCombined[2] = (byte)((value >> 16) & 255);
                }
                else
                {
                    t.z = (dataBase[0] / 255.0f) * 2.0f;
                    t.y = (dataBase[1] / 255.0f) * 2.0f - 1.0f;
                    t.x = (dataBase[2] / 255.0f) * 2.0f - 1.0f;

                    u.z = (dataDetail[0] / 255.0f) * 2.0f - 1.0f;
                    u.y = (dataDetail[1] / 255.0f) * -2.0f + 1.0f;
                    u.x = (dataDetail[2] / 255.0f) * -2.0f + 1.0f;

                    r = t * t.Dot(u) - u * t.z;

                    r.Normalize();

                    //Write data to our new bitmap
                    dataCombined[0] = (byte)Math.Round((r.z * 0.5f + 0.5f) * 255.0f);
                    dataCombined[1] = (byte)Math.Round((r.y * 0.5f + 0.5f) * 255.0f);
                    dataCombined[2] = (byte)Math.Round((r.x * 0.5f + 0.5f) * 255.0f);

                    value = dataCombined[0] | (dataCombined[1] << 8) | (dataCombined[2] << 16);
                    testDict.Add(key, value);
                }

                dataBase += xIncr;
                dataDetail += xIncr;
                dataCombined += xIncr;
            }
        }
        combineStopwatch.Stop();

        combinedBitmap.UnlockBits(bDataCombined);
        image2.UnlockBits(bData1);
        image1.UnlockBits(bData1);
        //combinedBitmap.Save("helloyou.png", ImageFormat.Png);
        testDict.Clear();

        Console.Write(combineStopwatch.ElapsedMilliseconds + "\n");
    }

I just took randomly 2 pictures, 12 mega pixel, taken from my camera and it is not too bad, maximum memory usage doesn't go up above 400 meg ram. — Fredou, Dec 09 '13 at 23:54
Thanks a lot for this! I just have a few questions about the “real magic” part :). To see if I completely understand: the ‘key’ consists of all the color (bit?) values, so if the same color in both the base and detail bitmap are seen again it takes the value out of the dictionary? I’m also not sure what (value & 255) does? — VincentC, Dec 10 '13 at 22:30
@VincentC exactly, the "value & 255" make sure to return a maximum of a byte so it wont "overflow" when converted. read: bitwise operation — Fredou, Dec 10 '13 at 23:55

score 0 · Answer 4 · edited May 23 '17 at 12:21

I have just added another answer to the StackOverflow-post you mentioned in your question. Fast work with Bitmaps

It tells you how to work directly with the Bitmap data in an Integer-array or a Byte-array without copying anything. It should save you quite some time. You can acually save time by working with an Integer array instead of Bytes because it takes fewer operations to read and write. All you need is some bitshifting magic which you will also find in the post I linked to.

Make sure to do as few type conversions as possible inside the loop as they are quite expensive.

I also agree with Fredou that you should look more carefully at the two lines:

r = t * t.Dot(u) - u * t.z;

r.Normalize();

You could try to unroll the functions to save some time. Create variables outside the loop:

float rx, ry, rz;
float tx, ty, tz;
float ux, uy, uz;
float dot, len;

And then in the loop:

'Dot
dot = tx*ux + ty*uy + tz*uz;
rx = tx * dot - ux*tz;
ry = ty * dot - uy*tz;
rz = tz * dot - uz*tz;

'Normalize
len = Math.sqrt(rx*rx + ry*ry + rz*rz);
rx /= len;
ry /= len;
rz /= len;

If you really need performance and can afford to loose some accuracy then replace your Math.sqrt() with one from this page. It basically says that you can convert between int and float by making a struct with LayoutKind.Explicit like this:

[StructLayout(LayoutKind.Explicit)]
private struct FloatIntUnion
{
    [FieldOffset(0)]
    public float f;

    [FieldOffset(0)]
    public int tmp;
}

Observer that this will not give you the same value in the int and the float as that requires a calculation for conversion. It will only allow you to use the same storage bits and treat them as either int/float. And then you can save half the time by calculating SQRT like this:

public static float QuickSqrt(float z){
    if (z == 0) return 0;
    FloatIntUnion u;
    u.tmp = 0;
    float xhalf = 0.5f * z;
    u.f = z;
    u.tmp = 0x5f375a86 - (u.tmp >> 1);
    u.f = u.f * (1.5f - xhalf * u.f * u.f);
    return u.f * z;
}

The article mentions that Quake 3 used this method :)

Faster way to combine bitmaps with custom algorithm?

4 Answers4