Need help optimizing speed for a loop/dot-product function in C++

Question

I've identified a function in a tightloop that's responsible for 50% of the time in my program(finding nearest neighbors). It calculates the euclidean distance between two unit vectors. Is there any way to make this run faster? (currently I am using gcc's -march=native and -ffast-math flags)

template<typename T>
  static inline T distance(const T* x, const T* y, int f) {
    T pp = 0, qq = 0, pq = 0;
    for (int z = 0; z < f; z++, x++, y++) {
      pp += (*x) * (*x);
      qq += (*y) * (*y);
      pq += (*x) * (*y);
    }
    T ppqq = pp * qq;
    if (ppqq > 0) return 2.0 - 2.0 * pq / sqrt(ppqq);
    else return 2.0;
  }

is this used for multiple types or T is actually only ever a single type? — Samer Tufail, Mar 29 '17 at 15:10
Presumably `T` must be `float` or `double`? Can `x` and `y` alias? Is `f` bounded above? — Useless, Mar 29 '17 at 15:17
you could initialize with the first elements and start the loop at 1 — Beginner, Mar 29 '17 at 15:21
"I am using gcc's -march=native and -ffast-math flags" - as well as `-O2` or `-O3` I assume? — Jesper Juhl, Mar 29 '17 at 15:31
@SamerTufail yes, they are floating point calculations. f is bounded above. what do you mean by `x` and `x` alias? — ijklr, Mar 29 '17 at 19:06
@SamerTufail it is actually used for single type, which is `float`. But since this C++ template shouldn't affect the performance like java's generics. — ijklr, Mar 29 '17 at 19:07
@VittorioRomeo very frequent, but still not as bad as the loop above it. That loop is just doing do-products. — ijklr, Mar 29 '17 at 19:09
@Beginner i think you're joking, but that is actually funny.. :) — ijklr, Mar 29 '17 at 19:10
How about adding a template parameter for `int f` and specializing for T=float/double and f=1,2,3,4? You could completely unroll that loop. — Beginner, Mar 30 '17 at 07:04
How about using SSE instructions, similar to this question: http://stackoverflow.com/questions/17761154/sse-reduction-of-float-vector — Beginner, Mar 30 '17 at 07:29
@Beginner I've tried loop unrolling and SSE for calculating dot-products, and with the right compiler flags, there is not much improvement. If you want to see the benchmarks, my code for that is here: https://github.com/ijklr/sse — ijklr, Mar 30 '17 at 16:39
@ijklr As the documentation ("man g++") would have told you faster than typing that question, it enables link time optimization. — Jesper Juhl, Mar 30 '17 at 16:54
@JesperJuhl Thanks, sorry I was using a mac and didn't have `man g++` in the terminal. but I sshed into a linux machine and see it that `-flto` flag doc now. — ijklr, Mar 30 '17 at 17:32
The correct answer t this question depends on typical values of `z`. is it typically 4, 4000 or 4.000.000 ? — MSalters, Apr 03 '17 at 09:44

Tomaz Canabrava · Answer 1 · 2017-04-03T09:30:12.697

1

This code is higly paralelizable, you could try to use OpenMP or a small thread to do the computations in different threads.

template<typename T>
  static inline T distance(const T[] x, const T[] y, int f) {
    T pp = 0, qq = 0, pq = 0;
    #pragma omp parallel for private(x,z) reduction(*:pp)
    for (int z = 0; z < f; z++, x++) {
      pp += (*x) * (*x);
    }
  }

OpenMP is a Paralelization library that relies on Pragmas for paralelizing code, it's really common in scientific computing. You can easily create threads by specifiying a pragma over the statement you wanna paralelize.

The simplest format is

#pragma omp parallel for
for(int i = 0; i < 10000; i++) {
    paralelized code here
}

This will take care of creating the threads, executing the paralelized code and joiining the threads back for you with a simple #pragma definition.

edited Apr 03 '17 at 09:30

answered Mar 29 '17 at 15:26

Tomaz Canabrava

2,320
15
20

Would be nice to briefly explain `#pragma omp parallel for private(x,z) reduction(+:pp)` for people unfamiliar with OpenMP – Vittorio Romeo Mar 29 '17 at 15:39
1

Agree with @VittorioRomeo .. I don't know what that means as well. It'd be nice to know. – ijklr Mar 29 '17 at 19:12
edited for more information – Tomaz Canabrava Mar 30 '17 at 06:33
Thanks @TomazCanabrava one more thing, can you explain what `private(x,z) reduction(+:pp)` does? – ijklr Mar 30 '17 at 16:41
private means that the variables are private of teach thread, and reduction says what operator should be used on the reduction, and the variable to store the calculation. – Tomaz Canabrava Apr 03 '17 at 09:29

Need help optimizing speed for a loop/dot-product function in C++

1 Answers1