4

I was working on network sort (for arrays smaller than 8) and noticed that all the algorithms focus on its ability to allow parallel operations. Here is one such set for an array of size 5.

 #define SWAP(x,y) if (data[y] < data[x]) { int tmp = data[x]; data[x] = data[y]; data[y] = tmp; }

    //Parallelizable
    SWAP(1, 2);
    SWAP(4, 5);

    //Parallelizable
    SWAP(0, 2);
    SWAP(3, 5);

    //Parallelizable
    SWAP(0, 1);
    SWAP(3, 4);
    SWAP(2, 5);

    //Parallelizable
    SWAP(0, 3);
    SWAP(1, 4);

    //Parallelizable
    SWAP(2, 4);
    SWAP(1, 3);

    //Parallelizable
    SWAP(2, 3);

I was working with long int arrays (So each element is 8 bytes in size). So is there any easy way to parallelize these operations in C ? Is there any hardware specific commands I can use to achieve this (SIMD, ASM(x86) etc.)

Paul R
  • 208,748
  • 37
  • 389
  • 560
chettyharish
  • 1,704
  • 7
  • 27
  • 41
  • How many arrays do you have? – this Jul 12 '15 at 21:54
  • Its one large array with a lot of elements (1 billion~). I use an offset in the SWAP which I am using. It will be something like SWAP(1, 2 , lo); where lo is the offset in the array. – chettyharish Jul 12 '15 at 21:57
  • Well you said you are sorting for sizes smaller than 8. So what are you sorting, the entire array or just parts of it? – this Jul 12 '15 at 21:58
  • I am using a form of parallelized merge sort and when the array size is <8 , then I switch to network sort. – chettyharish Jul 12 '15 at 21:59
  • Ok, so you have a lot of small arrays to be sorted. There is no need to parallelize your code as you have shown. Just distribute small arrays to be sorted individually. – this Jul 12 '15 at 22:00
  • Yeah I believe it will be somewhere near (2^30/2^3) small arrays – chettyharish Jul 12 '15 at 22:01
  • I think the code you have is in fact the easiest way to do this in C, it is just that the compiler has to do the optimization. The intel compiler can be told which chip set to compile for, however I don't think it has been released for the latest chip yet. Using hand tuned assembly for the latest xeon chip might get you a teensy improvement, but also might never get any advantage because the operation is memory bound. – fluffybunny Jul 13 '15 at 00:55

1 Answers1

2

As explained by this answer to a question about sorting small collections, you can actually make your swap code more performant by changing its definition to the following one:

#define SWAP(x, y) {                        \
    int dx = data[x];                       \
    data[x] = dx < data[y] ? dx : data[y];  \
    data[y] ^= dx ^ data[x];                \
}

According to the research paper Applying Sorting Networks to Synthesize Optimized Sorting Libraries, this version of SWAP is branchless and compiles down to a mere 5 instructions on GCC or Clang with a decent optimization level. The article also hints at the fact that the low number of instructions may actually make the code benefit from instruction-level parallelism.

If xor does not work for the types to be sort, you can use an alternative version of SWAP that uses two conditionals instead of one, which should be almost as fast as the xor version. Actually, I use this trick in a sorting library of mine and sorting a small fixed-size collection of integers with sorting networks went from « not really better than insertion sort » to « several times faster than insertion sort » when I introduced the trick. Sorting a collection of 8 integers is ~5 times faster with sorting networks than with an insertion sort on my computer.

Community
  • 1
  • 1
Morwenn
  • 21,684
  • 12
  • 93
  • 152