0

I was thinking again about implementing the quadratic sieve for fun, which requires Guassian elimination over a binary field, that is the operations required are 1. swapping rows and 2. XORing rows.

My ideas were either to maintain a bit array using a vector of 64-bit ints and bit twiddling, or use vector<bool>, which is probably space-optimized on my system. The bit array must be able to be dynamically sized, so std::bitset won't work. The advantage of maintaining my own ints is that I can XOR 64 bits at a time which is a neat trick. I wanted to see what a compiler would do for a loop that XOR'd bool vectors: (I wasn't able to use ^=, see operator |= on std::vector<bool>)

void xor_vector(std::vector<bool>& a, std::vector<bool>& b) {
    for (std::size_t i=0; i<a.size(); ++i) 
        a[i] = a[i] ^ b[i];
}

I have a very basic understanding of x86 but it looks like the compiler isn't actually XORing words together? Is there a way to get the compiler to XOR entire words at a time?

https://godbolt.org/z/PbGdv3sKT

xor_vector(std::vector<bool, std::allocator<bool> >&, std::vector<bool, std::allocator<bool> >&):
  mov r8, QWORD PTR [rdi]
  mov rax, QWORD PTR [rdi+16]
  mov edx, DWORD PTR [rdi+24]
  sub rax, r8
  lea rdi, [rdx+rax*8]
  test rdi, rdi
  je .L11
  push rbp
  mov r10d, 1
  push rbx
  mov r9, QWORD PTR [rsi]
  xor esi, esi
  jmp .L7
.L16:
  mov rdx, r10
  sal rdx, cl
  mov rcx, QWORD PTR [r11]
  mov rbp, rdx
  test rdx, rcx
  setne bl
  and rbp, QWORD PTR [rax]
  setne bpl
.L4:
  mov rax, rdx
  not rdx
  or rax, rcx
  and rdx, rcx
  cmp bpl, bl
  cmovne rdx, rax
  add rsi, 1
  mov QWORD PTR [r11], rdx
  cmp rsi, rdi
  je .L15
.L7:
  test rsi, rsi
  lea rax, [rsi+63]
  mov rdx, rsi
  cmovns rax, rsi
  sar rdx, 63
  shr rdx, 58
  sar rax, 6
  lea rcx, [rsi+rdx]
  sal rax, 3
  and ecx, 63
  lea r11, [r8+rax]
  add rax, r9
  sub rcx, rdx
  jns .L16
  add rcx, 64
  mov rdx, r10
  sal rdx, cl
  mov rcx, QWORD PTR [r11-8]
  mov rbp, rdx
  test rcx, rdx
  setne bl
  and rbp, QWORD PTR [rax-8]
  setne bpl
  sub r11, 8
  jmp .L4
.L15:
  pop rbx
  pop rbp
  ret
.L11:
  ret

My question is similar to bitwise operations on vector<bool> but the answers are dated and don't seem to answer my question.

Update: I tested with a 256 bit sized bitset too. Still I don't see XORing whole machine words.

void xor_vector(std::bitset<256>& a, std::bitset<256>& b) {
    for (std::size_t i=0; i<a.size(); ++i) 
        a[i] = a[i] ^ b[i];
}

https://godbolt.org/z/jKEf89E1j

xor_vector(std::bitset<256ul>&, std::bitset<256ul>&):
  push rbx
  mov r8, rdi
  mov r11, rsi
  xor edx, edx
  mov ebx, 1
.L4:
  mov rsi, rdx
  mov rcx, rdx
  mov rax, rbx
  shr rsi, 6
  and ecx, 63
  sal rax, cl
  mov rdi, QWORD PTR [r8+rsi*8]
  mov rcx, rax
  and rcx, QWORD PTR [r11+rsi*8]
  mov rcx, rax
  setne r10b
  test rax, rdi
  not rax
  setne r9b
  or rcx, rdi
  and rax, rdi
  cmp r10b, r9b
  cmovne rax, rcx
  add rdx, 1
  mov QWORD PTR [r8+rsi*8], rax
  cmp rdx, 256
  jne .L4
  pop rbx
  ret
qwr
  • 9,525
  • 5
  • 58
  • 102
  • Consider `std::bitset`. – Eugene Oct 27 '21 at 21:57
  • The leading answer to the linked question is to use `std::bitset`, which is not dated in any way. The [cppreference link](https://en.cppreference.com/w/cpp/utility/bitset) still works, too. – Nate Eldredge Oct 27 '21 at 21:57
  • 1
    I thought bitset was not dynamically sized. The program should handle any size vectors. I have updated my question – qwr Oct 27 '21 at 21:59
  • I guess that if you need performance here and compiler doesn't support such optimization, then you have to sacrifice readability and use vector of `std::bitset` or vector of `uint` (keeping the last X bits unused). But an interesting question would be - how possible (and feasible) would be such optimization in a compiler. Recognizing such a simple loop and using underlying memory for operation seems quite doable. – Yksisarvinen Oct 27 '21 at 22:08
  • 1
    `std::vector` is required to be a bit-vector, not an array of `bool`. Unfortunately that means its iterators and indexing are complicated, apparently too complicated for that compiler to see through it and compile it into simple asm that XORs whole chunks (or vectorizes across chunks). https://isocpp.org/blog/2012/11/on-vectorbool - with template specializations by the library, some algorithms over `std::vector` can be great, but the lack of `operator ^=` means there's no standard way to work around the per-bit accessing. – Peter Cordes Oct 27 '21 at 22:09
  • @PeterCordes Oh i've seen and briefly skimmed that article. I'm not a fan of how `vector` is not actually a vector of `bool` datatypes. Confusing naming. – qwr Oct 27 '21 at 22:12
  • clang with `-O3 -stdlib=libc++` doesn't do any better, for x86-64 or ARM. https://godbolt.org/z/vz6qa5hfz – Peter Cordes Oct 27 '21 at 22:13
  • I can hardly even make sense of what the compiler is trying to do here. Lots of weird stuff. Even if it somehow is going by chunks of 64 bits (I don't think so, but I haven't decoded it enough to be sure), then it's doing it so badly that you still may as well do it yourself. – harold Oct 27 '21 at 22:13
  • Yeah, it's widely agreed that `vector` is was terrible way to expose a variable-length bitvector data structure. That's exactly what Howard said in the linked article. – Peter Cordes Oct 27 '21 at 22:14
  • @harold the most straightforward way to xor 64 bits at a time would just be a xor instruction on two 64-bit registers right? – qwr Oct 27 '21 at 22:15
  • @harold: I'm pretty confident it's going 1 bit at a time. Note the `and ecx, 63` to modulo the bit-index. clang for x86-64 and ARM is doing similar things. – Peter Cordes Oct 27 '21 at 22:16
  • 1
    The most straightforward way would be one load (1 uop), one memory-destination xor into the other vector to modify it in-place (2 uops if you avoid an indexed addressing mode). Or if vectorizing then memory-destination isn't available, so it's load / `vpxor ymm0, ymm0, [rdi]` / `vmovdqu [rdi], ymm0`. You should get code like that from a simple loop that XORs in `unsigned long` chunks, with `-O3 -march=native` – Peter Cordes Oct 27 '21 at 22:18
  • @PeterCordes maybe I can ask in a different question, possibly on software engineering stackexchange, if the C++ committee will ever be able to deprecate the current meaning of `vector` and in the future make it a vector of bools. – qwr Oct 27 '21 at 22:21
  • @qwr: They *could* in theory, but the benefit is pretty small so I don't expect it to happen. I guess you mean to ask whether it's plausible that the actual committee members would actually agree on doing that over the course of a decade or something. – Peter Cordes Oct 27 '21 at 22:25
  • 2
    Re: update with std::bitset: the advantage to bitset is that it supports `a ^= b`, which does compile nicely: https://godbolt.org/z/Gs1oPaevs. It's not surprising that a loop over individual bits doesn't get optimized back into whole-chunk accesses; if that could happen for bitset indexing, you'd expect it for vector indexing too. – Peter Cordes Oct 27 '21 at 22:28
  • Since you mention "specialization" in the title, I think it would be necessary to let the compiler see full chunk xor operations, which would mean specializing by adding `std::vector::operator^=` or something. Or possibly specializing [`std::transform`](https://en.cppreference.com/w/cpp/algorithm/transform) for `vector` iterator ranges. But I'm not sure std::transform works over vector-bool iterators normally, so this wouldn't just be an optimization, it would be required for code using it to compile. – Peter Cordes Oct 28 '21 at 00:58
  • Try [boost::dynamic_bitset](https://www.boost.org/doc/libs/1_77_0/libs/dynamic_bitset/dynamic_bitset.html) instead https://godbolt.org/z/jdarnPzoo – phuclv Oct 28 '21 at 02:48
  • @phuclv still don't see the compiler doing xor in chunks – qwr Oct 31 '21 at 08:10

0 Answers0