Efficient way to XOR two unsigned char arrays?

Question

I am implementing an encryption algorithm and was wondering if there was a more efficient way than O(n) for xoring two unsigned char arrays? I essentially want to know if it's possible to avoid doing something like this:

   unsigned char a1[64];
   unsigned char a2[64];
   unsigned char result[64];

   for (int i=0;i<64;i++)
   {

       result[i] = a1[i] ^ a2[i];
   
   }

Like I said, I know there's nothing inherently wrong with doing this but I was wondering if there was possibly a more efficient method. I need to keep my algorithm as streamlined as possible.

Thanks in advance for any information!

you cannot so something with N array elements in less than `O(N)` (unless there is a way to not touch some of the elements because their value can be inferred from other elements) — 463035818_is_not_an_ai, Jul 19 '22 at 11:56
Possibly you could work with elements bigger than `unsigned char` but that would still be O(N), which obviously can't be beaten. — john, Jul 19 '22 at 11:58
For AVX512, it is O(1) core-wise using 512-bit operation, for up to 64 char-elements. But with just xor, memory is performance limiter. — huseyin tugrul buyukisik, Jul 19 '22 at 11:59
AVX512 is a good idea, details : but it will still be O(n). With 64 bytes n==1 — Pepijn Kramer, Jul 19 '22 at 12:01
If offloaded to GPU, it will be O(1) GPU-wise until N=~10k. But data copying is still O(N) (and much slower than CPU accessing RAM). — huseyin tugrul buyukisik, Jul 19 '22 at 12:01
@huseyintugrulbuyukisik thats cheating. You can always put a limit on `N` and then any algorithm will be `O(1)`. Anyhow, if OP is interested in arrays of size 64 then big-O complexity is rather irrelevant — 463035818_is_not_an_ai, Jul 19 '22 at 12:06
@463035818_is_not_a_number This makes sense, I suppose the overhead really is quite negligible anyway. — GARcher, Jul 19 '22 at 12:20
Have you tried loop unrolling? Performing more data operations inside the loop reduces the number of compares and increments. — Thomas Matthews, Jul 19 '22 at 18:00

score 3 · Accepted Answer · answered Jul 19 '22 at 12:08

As the comments note, the only way of doing this faster than O(n) is not doing it for all elements, In fact, don't do it for any elements!

The reason is that you're writing a cryptographic algorithm. You'll use results[i] a few lines lower. That part will likely be numerically expensive, while this XOR is limited by memory bandwidth. If you replace results[i] with a1[i] ^ a2[i] in the cryptographic operation, the CPU is likely to overlap the memory access and the computation.

Thank you, this makes a lot of sense actually. – GARcher Jul 19 '22 at 12:16 — GARcher, Jul 19 '22 at 12:16

score 1 · Answer 2 · answered Jul 19 '22 at 12:10

With AVX512 it will look like this :

//unsigned char a1[64];
//unsigned char a2[64];
//unsigned char result[64];

#include <cstdint>
#include <immintrin.h>
#include <iostream>
#include <iomanip>

union int512
{
    std::uint8_t bytes[64]{};
    __m512 value;
};

std::ostream& operator<<(std::ostream& os, const int512 value)
{
    bool comma = false;
    for (const auto& byte : value.bytes)
    {
        if (comma) std::cout << ", ";
        std::cout << std::hex << "0x" << static_cast<int>(byte);
        comma = true;
    }
    return os;
}

int main()
{
    int512 data1;
    int512 data2;

    std::uint8_t n{ 0u };
    for (auto& byte : data1.bytes) byte = n++;
    for (auto& byte : data2.bytes) byte = 0xff;

    int512 result;
    result.value = _mm512_xor_ps(data1.value, data2.value);

    std::cout << "data1 = " << data1 << "\n";
    std::cout << "data2 = " << data2 << "\n";
    std::cout << "xor   = " << result << "\n";

    return 0;
}

you don't even need to do that. All modern compilers can auto-vectorize those trivial loops https://godbolt.org/z/q3cPo8hrh — phuclv, Jul 19 '22 at 12:30
Yes compilers are good at that. And in more complex situations they can also optimize for the pipelining of the CPU (this was just a good excuse for me to try an intrinsic manually for once) — Pepijn Kramer, Jul 19 '22 at 12:31

score 0 · Answer 3 · answered Jul 19 '22 at 12:43

0

And maybe, with luck, the good old std::valarray will be also very fast.

#include <valarray>

std::valarray<char> a1(64);
std::valarray<char> a2(64);
std::valarray<char> result(64);

int main() {
    result = a1 ^ a2;
}

Please read here and here

But I think that a good optimizing compiler will outperform all our manual and handwriten optimization attempts . . .

answered Jul 19 '22 at 12:43

A M

14,694
5
19
44

1

no you should never use `std::valarray` because it's very slow: [Why don't std::valarray get more attention from all C++ compilers?](https://stackoverflow.com/q/71951442/995714), [Why is valarray so slow?](https://stackoverflow.com/q/6850807/995714), [Why is valarray so slow on Visual Studio 2015?](https://stackoverflow.com/q/56050322/995714). Compilers optimize a normal array or vector better and vectorize them but not valarray – phuclv Jul 19 '22 at 13:08

Efficient way to XOR two unsigned char arrays?

3 Answers3