0

I've identified a function in a tightloop that's responsible for 50% of the time in my program(finding nearest neighbors). It calculates the euclidean distance between two unit vectors. Is there any way to make this run faster? (currently I am using gcc's -march=native and -ffast-math flags)

template<typename T>
  static inline T distance(const T* x, const T* y, int f) {
    T pp = 0, qq = 0, pq = 0;
    for (int z = 0; z < f; z++, x++, y++) {
      pp += (*x) * (*x);
      qq += (*y) * (*y);
      pq += (*x) * (*y);
    }
    T ppqq = pp * qq;
    if (ppqq > 0) return 2.0 - 2.0 * pq / sqrt(ppqq);
    else return 2.0;
  }
sepp2k
  • 363,768
  • 54
  • 674
  • 675
ijklr
  • 145
  • 8
  • is this used for multiple types or T is actually only ever a single type? – Samer Tufail Mar 29 '17 at 15:10
  • Presumably `T` must be `float` or `double`? Can `x` and `y` alias? Is `f` bounded above? – Useless Mar 29 '17 at 15:17
  • How frequent/fast is the call to `sqrt`? – Vittorio Romeo Mar 29 '17 at 15:17
  • you could initialize with the first elements and start the loop at 1 – Beginner Mar 29 '17 at 15:21
  • 1
    "I am using gcc's -march=native and -ffast-math flags" - as well as `-O2` or `-O3` I assume? – Jesper Juhl Mar 29 '17 at 15:31
  • @JesperJuhl yes, `-O3` is used as well. – ijklr Mar 29 '17 at 19:04
  • @SamerTufail yes, they are floating point calculations. f is bounded above. what do you mean by `x` and `x` alias? – ijklr Mar 29 '17 at 19:06
  • @SamerTufail it is actually used for single type, which is `float`. But since this C++ template shouldn't affect the performance like java's generics. – ijklr Mar 29 '17 at 19:07
  • @VittorioRomeo very frequent, but still not as bad as the loop above it. That loop is just doing do-products. – ijklr Mar 29 '17 at 19:09
  • @Beginner i think you're joking, but that is actually funny.. :) – ijklr Mar 29 '17 at 19:10
  • Have you tried building with `-flto`? – Jesper Juhl Mar 29 '17 at 20:53
  • How about adding a template parameter for `int f` and specializing for T=float/double and f=1,2,3,4? You could completely unroll that loop. – Beginner Mar 30 '17 at 07:04
  • 1
    How about using SSE instructions, similar to this question: http://stackoverflow.com/questions/17761154/sse-reduction-of-float-vector – Beginner Mar 30 '17 at 07:29
  • @Beginner I've tried loop unrolling and SSE for calculating dot-products, and with the right compiler flags, there is not much improvement. If you want to see the benchmarks, my code for that is here: https://github.com/ijklr/sse – ijklr Mar 30 '17 at 16:39
  • @JesperJuhl what does that flag do? – ijklr Mar 30 '17 at 16:41
  • @ijklr As the documentation ("man g++") would have told you faster than typing that question, it enables link time optimization. – Jesper Juhl Mar 30 '17 at 16:54
  • @JesperJuhl Thanks, sorry I was using a mac and didn't have `man g++` in the terminal. but I sshed into a linux machine and see it that `-flto` flag doc now. – ijklr Mar 30 '17 at 17:32
  • The correct answer t this question depends on typical values of `z`. is it typically 4, 4000 or 4.000.000 ? – MSalters Apr 03 '17 at 09:44

1 Answers1

1

This code is higly paralelizable, you could try to use OpenMP or a small thread to do the computations in different threads.

template<typename T>
  static inline T distance(const T[] x, const T[] y, int f) {
    T pp = 0, qq = 0, pq = 0;
    #pragma omp parallel for private(x,z) reduction(*:pp)
    for (int z = 0; z < f; z++, x++) {
      pp += (*x) * (*x);
    }
  }

OpenMP is a Paralelization library that relies on Pragmas for paralelizing code, it's really common in scientific computing. You can easily create threads by specifiying a pragma over the statement you wanna paralelize.

The simplest format is

#pragma omp parallel for
for(int i = 0; i < 10000; i++) {
    paralelized code here
}

This will take care of creating the threads, executing the paralelized code and joiining the threads back for you with a simple #pragma definition.

Tomaz Canabrava
  • 2,320
  • 15
  • 20