5

At here, it is defined this function:

        template <typename T, typename = std::enable_if_t<is_uint32_v<T> || is_uint64_v<T>>>
        inline T reverse_bits(T operand, int bit_count)
        {
            // Just return zero if bit_count is zero
            return (bit_count == 0) ? T(0)
                                    : reverse_bits(operand) >> (sizeof(T) * static_cast<std::size_t>(bits_per_byte) -
                                                                static_cast<std::size_t>(bit_count));
        }

At a later point, this function is used to store elements in a scrambled way into an array:

inv_root_powers_[reverse_bits(i - 1, coeff_count_power_) + 1].set(power, modulus_);

The justification for this is so that the memory access is coalesced. However, I don't know why such random values would make it easier for the memory access. For example, here are some values:

reverse_bits(3661, 12) +1 = 2856
reverse_bits(3662, 12) +1 = 1832
reverse_bits(3663, 12) +1 = 3880
reverse_bits(3664, 12) +1 = 168
reverse_bits(3665, 12) +1 = 2216
reverse_bits(3666, 12) +1 = 1192
reverse_bits(3667, 12) +1 = 3240
reverse_bits(3668, 12) +1 = 680
reverse_bits(3669, 12) +1 = 2728

seems like things are stored far apart.

Guerlando OCs
  • 1,886
  • 9
  • 61
  • 150

1 Answers1

3

You're right - the accesses you see in NTTTables::initialize are random-access and not serial. It is slower because of this "scramble". However, most of the work happens only later in DWTHandler::transform_to_rev, when the transform itself is applied.

There, they need to access the roots by reverse-bits order. The array being pre-scrambled means all the accesses to this array are now serial: you can see this in the r = *++roots; lines.

The reverse-bits access pattern has a good, real reason - it's because they're doing a variant of the Finite Fourier Transform (FFT). The memory access patterns used in those algorithms (sometimes called butterflies) are done in a bit-reverse order.

unddoch
  • 5,790
  • 1
  • 24
  • 37