The fastest way to use a binary expression on array of booleans

Question

I need to do something like this in the fastest way possible (O(1) would be perfect):

for (int j = 0; j < V; ++j)
        {
            if(!visited[j]) required[j]=0;
        }

I came up with this solution:

for (int j = 0; j < V; ++j)
        {
             required[j]=visited[j]&required[j];
        }

Which made the program run 3 times faster but I believe there is an even better way to do this. Am I right?

Btw. required and visited are dynamically allocated arrays

bool *required;
bool *visited;
required = new bool[V];
visited = new bool[V];

I doubt you could do it in `O(1)`. Also be sure you are compiling with optimizations on. — Colonel Thirty Two, Oct 03 '15 at 20:43
You can do it in O(1) by not doing it at all. Compute values of `required[j]` lazily when they're actually needed. — Oliver Charlesworth, Oct 03 '15 at 20:45
The "fastest way" has little to do with big O's and whatnot, and more with the low level optimization that you were sort of already doing. Do you want more of that? Perhaps using SIMD intrinsics? — harold, Oct 03 '15 at 20:46
Do you have to use an array of bool? Something like a `bitset` (or `dynamic_bitset`) would be much faster, because one instruction can handle 32 or 64 booleans at once. — Alan Stokes, Oct 03 '15 at 20:53

Aidan Gomez · Accepted Answer · 2015-10-03T21:42:39.193

2

In the case where you're using a list of simple objects, you are most likely best suited using the functionality provided by the C++ Standard Library. Structures like valarray and vectors are recognized and optimized very effectively by all modern compilers.

Much debate exists as to how much you can rely on your compiler, but one guarantee is, your compiler was built alongside the standard library and relying on it for basic functionality (such as your problem) is generally a safe bet.

Never be afraid to run your own time tests and race your compiler! It's a fun exercise and one that is ever increasingly difficult to achieve.

Construct a valarray (highly optimized in c++11 and later):

std::valarray<bool> valRequired(required, V);
std::valarray<bool> valVisited(visited, V);
valRequired &= valVisited;

Alternatively, you could do it with one line using transform:

std::transform(required[0], required[V-1], visited[0], required[0], [](bool r, bool v){ return r & v; })

Edit: while fewer lines is not faster, your compiler will likely vectorize this operation.

I also tested their timing:

int main(int argc, const char * argv[]) {
    auto clock = std::chrono::high_resolution_clock{};
    {
        bool visited[5] = {1,0,1,0,0};
        bool required[5] = {1,1,1,0,1};

        auto start = clock.now();
        for (int i = 0; i < 5; ++i) {
            required[i] &= visited[i];
        }
        auto end = clock.now();
        std::cout << "1: " << (end - start).count() << std::endl;
    }

    {
        bool visited[5] = {1,0,1,0,0};
        bool required[5] = {1,1,1,0,1};

        auto start = clock.now();
        for (int i = 0; i < 5; ++i) {
            required[i] = visited[i] & required[i];
        }
        auto end = clock.now();
        std::cout << "2: " << (end - start).count() << std::endl;
    }

    {
        bool visited[5] = {1,0,1,0,0};
        bool required[5] = {1,1,1,0,1};

        auto start = clock.now();
        std::transform(required, required + 4, visited, required, [](bool r, bool v){ return r & v; });
        auto end = clock.now();
        std::cout << "3: " << (end - start).count() << std::endl;
    }

    {
        bool visited[5] = {1,0,1,0,0};
        bool required[5] = {1,1,1,0,1};
        std::valarray<bool> valVisited(visited, 5);
        std::valarray<bool> valrequired(required, 5);

        auto start = clock.now();
        valrequired &= valVisited;
        auto end = clock.now();
        std::cout << "4: " << (end - start).count() << std::endl;
    }
}

Output:

1: 102
2: 55
3: 47
4: 45
Program ended with exit code: 0

edited Oct 03 '15 at 21:42

answered Oct 03 '15 at 20:49

Aidan Gomez

8,167
5
28
51

@RussSchultz and harold You're both right, it certainly isn't magic. In fact, compiler have the very non-magical insight into the standard library through it being a **standard**. Using it provides consistency and guarantee that the implementation will be one accessible and familiar to your compiler; unlike you trying to optimize your code the way you might in the 90s. Yes, you're both right, there are cases where one can foresee the compiler taking "the long route" being of an unfamiliar object structure, and one should handle those cases appropriately. – Aidan Gomez Oct 03 '15 at 21:34
As @harold pointed out, this is not one of those cases. However, I'll revise and note this in my answer. Let me know if you see any more room for improvement in my answer. – Aidan Gomez Oct 03 '15 at 21:46
1

No, the compiler only could have insight into the interface of the STL, not the implementation (because the STL can be replaced, i.e. newlib vs. uclib, vs. vendor supplied). The compiler can't (or shouldn't) make any assumptions as to what's happening inside the calls to the STL, though I know some compilers for embedded products turn strcpy, strlen, memcpy, memset, etc. into inline assembly rather than calls into libc. – Russ Schultz Oct 03 '15 at 21:49
That being said, use STL first, then look for optimizations once your code works. Nothing slows a project down worse than getting stuck trying to right the exact optimal solution for every operation. – Russ Schultz Oct 03 '15 at 21:50
@RussSchultz Interesting, it was my understanding that compilers provide their own implementation of the STL? After reading a couple pieces I see that it's the ABI that is bundled with them. Thanks for the insight! – Aidan Gomez Oct 03 '15 at 21:53
I tested with arrays sized 425 elements with g++ 5.2.0, and method 3 is something like 10 faster than the rest. Why would you test with only 5 elements? (and testing with 5, my numbers are not very useful... bascally all random and unknown which is faster) – Peter Oct 03 '15 at 22:01
@AidanGomez Nope, they usually provide a version from the GNU implementation or license from Dinkum for the STL. There's also several sources of libc. Some vendors do provide their own. The implementations can be all over the place when talking about suitability for embedded work (and general performance). – Russ Schultz Oct 03 '15 at 22:04
@Peter I assumed V was small. Wasn't stress-testing, feel free to edit. Building for 5 on Clang 700.0.72 had consistent results. Order never changed. – Aidan Gomez Oct 03 '15 at 22:14
Do you have optimizations enabled? I find it hard to believe that having a library call a function pointer to do the `&=` would be faster than doing a hard coded `for()` loop, unless you don't have optimizations enabled. – Russ Schultz Oct 03 '15 at 22:30
I had maximum optimization enabled, yes. – Aidan Gomez Oct 03 '15 at 22:48
See http://cpp.sh/8eea for some interesting insight on the varying performance of the different options. It looks like timing is mildly suspect (first run is always much slower). Method 1&2 are essentially the same (as I would expect), 3 is slower (also expected), and 4 is somewhere in the middle. – Russ Schultz Oct 03 '15 at 23:07
@RussSchultz timing wouldn't be the issue, just processor things :) http://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array – Aidan Gomez Oct 05 '15 at 22:32

score 0 · Answer 2 · answered Oct 04 '15 at 13:52

0

In the line of @AlanStokes, use packed binary data and combine with the AVX instruction _mm512_and_epi64, 512 bits at a time. Be prepared for your hair messed up.

answered Oct 04 '15 at 13:52

The fastest way to use a binary expression on array of booleans

2 Answers2