Simultaneously multiply all struct-elements with a scalar

Question

I have a struct that represents a vector. This vector consists of two one-byte integers. I use them to keep values from 0 to 255.

typedef uint8_T unsigned char;

struct Vector
{
  uint8_T x;
  uint8_T y;
};

Now, the main use case in my program is to multiply both elements of the vector with a 32bit float value:

typedef real32_T float;

Vector Vector::operator * ( const real32_T f ) const {
  return Vector( (uint8_T)(x * f), (uint8_T)(y * f) );
};

This needs to be performed very often. Is there a way that these two multiplications can be performed simultaneously? Maybe by vectorization, SSE or similar? Or is the Visual studio compiler already doing this simultaneously?

Another usecase is to interpolate between two Vectors.

Vector Vector::interpolate(const Vector& rhs, real32_T z) const
{
  return Vector(
        (uint8_T)(x + z * (rhs.x - x)),
        (uint8_T)(y + z * (rhs.y - y))
        );
}

This already uses an optimized interpolation aproach (https://stackoverflow.com/a/4353537/871495).

But again the values of the vectors are multiplied by the same scalar value. Is there a possibility to improve the performance of these operations?

Thanks

(I am using Visual Studio 2010 with an 64bit compiler)

Why don't you compile with optimizations and then profile the code. No sense messing with it if it is not the problem. — NathanOliver, Feb 16 '15 at 14:39
ok thanks. I know that this part slows down my program, but if there is no potential to optimize it then I have to look at other parts of my code. — Gustav-Gans, Feb 16 '15 at 14:48

Daerst · Accepted Answer · 2015-02-16T15:06:01.520

In my experience, Visual Studio (especially an older version like VS2010) does not do a lot of vectorization on its own. They have improved it in the newer versions, so if you can, you might see if a change of compiler speeds up your code.

Depending on the code that uses these functions and the optimization the compiler does, it may not even be the calculations that slow down your program. Function calls and cache misses may hurt a lot more.

You could try the following:

If not already done, define the functions in the header file, so the compiler can inline them
If you use these functions in a tight loop, try doing the calculations 'by hand' without any function calls (temporarily expose the variables) and see if it makes a speed difference)
If you have a lot of vectors, look at how they are laid out in memory. Store them contiguously to minimize cache misses.
For SSE to work really well, you'd have to work with 4 values at once - so multiply 2 vectors with 2 floats. In a loop, use a step of 2 and write a static function that calculates 2 vectors at once using SSE instructions. Because your vectors are not aligned (and hardly ever will be with 8 bit variables), the code could even run slower than what you have now, but it's worth a try.
If applicable and if you don't depend on the clamping that occurs with your cast from float to uint8_t (e.g. if your floats are in range [0,1]), try using float everywhere. This may allow the compiler do do far better optimization.

Thanks. Most of the answers pointed out that the conversion from int to float may be slow. I didn't use floata because i do not need their precision. But i don't have a Problem with the memory consumption. I will first try this out, because it will be easy to implement. — Gustav-Gans, Feb 16 '15 at 16:40

score 1 · Answer 2 · edited May 23 '17 at 11:56

1

You haven't showed the full algorithm, but the conversions between integer and float numbers is a very slow operation. Eliminating this operation and using only one type (if possible preferably integers) can greatly improve performances.

Alternatevly, you can use lrint() to do the conversion as explained here.

edited May 23 '17 at 11:56

Community

1
1

answered Feb 16 '15 at 15:01

BЈовић

62,405
41
173
273

Simultaneously multiply all struct-elements with a scalar

2 Answers2