0

Here is my question:

Bits twiddling hack

I need to do that very efficiently (I will need to do this operation several billion times on supercomputers) in C or C++11. N and n are known at compile-time (template parameters). What is the most efficient algorithm to do that ?

Here is an example:

#include <iostream>
#include <climits>
#include <type_traits>
#include <bitset>

template <unsigned int Modulo,
          typename Type,
          unsigned int Size = sizeof(Type)*CHAR_BIT,
          class = typename std::enable_if<std::is_integral<Type>::value
                                       && std::is_unsigned<Type>::value>::type>
inline Type f(Type x)
{
    // The most inefficient algorithm ever
    std::bitset<Size> bx(x);
    std::bitset<Size> by(0);
    unsigned int j = 0;
    for (unsigned int i = 0; i < Size; ++i) {
        if (i%Modulo) {
            by[j++] = bx[i];
        }
    }
    return by.to_ullong();
}

int main()
{
    std::bitset<64> x = 823934823;
    std::cout<<x<<std::endl;
    std::cout<<(std::bitset<64>(f<2>(x.to_ullong())))<<std::endl;
    return 0;
}
Vincent
  • 57,703
  • 61
  • 205
  • 388
  • 1
    Do those supercomputers have `pext` (or an equivalent instruction)? This operation is pretty painful with only general purpose bitwise instructions. – harold Jan 15 '14 at 15:48
  • 2
    This question is off-topic because where is the minimal understanding thing thing thing? Anyway, off-topic. Also, give me back my minimal understanding requirement. Endrant. – ScarletAmaranth Jan 15 '14 at 15:48
  • @ScarletAmaranth As I read [**it**](http://meta.stackexchange.com/questions/211080/improving-demonstrate-a-minimal-understanding-close-reason/215546#215546), you should now use the "unclear" reason (yeah, weird, but anyway). – Bernhard Barker Jan 15 '14 at 15:50
  • @Dukeling I want my minimal understanding thing back! :) – ScarletAmaranth Jan 15 '14 at 15:51
  • 1
    @harold : No I only have standard bitwise instructions (my code need to runs on every platform with a standard-compliant C/C++ compiler) – Vincent Jan 15 '14 at 15:53
  • 1
    Well damn. Anyway, put that "permutation" in here to get instant code: http://programming.sirrida.de/calcperm.php – harold Jan 15 '14 at 15:54
  • 7
    I have a non-super-computer that I can take with me on the bus that can do a billion operations in a *quarter of a second*. Are you sure you actually need the **most efficient algorithm in the world** to do this only a billion times on a supercomputer? – Eric Lippert Jan 15 '14 at 15:55
  • 4
    The answer to your question is almost certainly: build a lookup table. If `N` is 8 and `n` is between 1 and 7 then there are less than a thousand possibilities. Enumerate them all, put them in a lookup table, and you're done. – Eric Lippert Jan 15 '14 at 15:57
  • 6
    I downvoted because you're asking us to do things for you whilst you have shown zero effort whatsoever. What have you tried? Why was it slow? How did you benchmark? What were your results and what leads you to believe it can be improved on? – ScarletAmaranth Jan 15 '14 at 15:58
  • 4
    Just because you include a pretty image explaining what your problem is doesn't mean you have made an attempt to actually solve it (and got stuck ipso facto). – ScarletAmaranth Jan 15 '14 at 16:01
  • 1
    @ScarletAmaranth +1 for using the word "whilst". – Fiddling Bits Jan 15 '14 at 16:01
  • Some architectures have this in hardware,very efficient. – PlasmaHH Jan 15 '14 at 16:08
  • 1
    Everyone is asking you to post at least a non-efficient version, with some insight into why its not efficient enough. – Glenn Teitelbaum Jan 15 '14 at 16:17
  • @PlasmaHH : what is the name of this instruction and what would be a standard code that a compiler could optimize using this instruction ? – Vincent Jan 15 '14 at 16:37
  • 2
    @vincent: See Intel's info for the `PEXT` instruction (part of BMI2, introduced with Haswell CPUs) – Brendan Jan 15 '14 at 16:57
  • How big will N get? The picture illustrates N = 8; will you need N = 16, 32, 64? How big will n get? For N = 8, n < 8; if N can be larger than 8, how big does n get? – Jonathan Leffler Jan 16 '14 at 02:43

1 Answers1

2

Semantics first...

Semantically (and conceptually, because you can't actually use iterators here), you are doing a std::copy_if where your input and output ranges are a std::bitset<N> and your predicate is a lambda of the form (using C++14 generic lambda notation)

[](auto elem) { return elem % n != 0; }

This algorithm has O(N) complexity in the number of assignments and number of invocations of your predicate. Because std::bitset<N> doesn't have iterators, you have to check bit by bit. This means that your loop with a handwritten predicate is doing the exact same computation as a std::copy_if over a hypothetical iterable std::bitset<N>.

This means that as far as asympotic efficiency is concerned, your algorithm should not be considered as inefficient.

...optimization last

So given the conclusion that your algorithm isn't doing anything as bad as quadratic complexity, can its constant factor be optimized? The main source of efficiency of a std::bitset comes from the fact that your hardware can handle many (8, 16, 32 or 64) bits in parallel. If you had access to the implementation, you could write your own copy_if that takes advantage of that parallelism, e.g. by special hardware instructions, lookup tables, or some bit-twiddling algorithm.

E.g. this is how the member function count(), as well as the gcc and SGI extensions Find_first_() and Find_next_() are implemented. The old SGI implementation uses lookup tables of 256 entries to handle bit count and quasi-iteration over the bits of each 8-bit char. The latest gcc version uses __builtin_popcountll() and __builtin_ctzll() to do population count and bit lookup for each 64-bit word.

Unfortunately, std::bitset does not expose its underlying array of unsigned integers. So if you want to improve your posted algorithm, you need to write your own BitSet class template (possible by adapting the source of your own Standard Library) and give it a member function copy_if (or similar) that takes advantage of your hardware. It can give efficiency gains of a factor of 8 to 64 compared to your current algorithm.

Community
  • 1
  • 1
TemplateRex
  • 69,038
  • 19
  • 164
  • 304