I have more than 1e7 sequences of tokens, where each token can only take one of four possible values.
In order to make this dataset fit into memory, I decided to encode each token in 2 bits, which allows to store 4 tokens in a byte instead of just one (when using a char
for each token / std::string
for a sequence). I store each sequence in a char
array.
For some algorithm, I need to test arbitrary subsequences of two token sequences for exact equality. Each subsequence can have an arbitrary offset. The length is typically between 10 and 30 tokens (random) and is the same for the two subsequences.
My current method is to operate in chunks:
- Copy up to 32 tokens (each having 2 bit) from each subsequences into an
uint64_t
. This is realized in a loop over the tokens that selects the correctchar
in the array and writes the bits into the correct position of theuint64_t
. - Compare the two
uint64_t
. If they are not equal,return
. - Repeat until all tokens in the subsequences have been processed.
#include <climits>
#include <cstdint>
using Block = char;
constexpr int BitsPerToken = 2;
constexpr int TokenPerBlock = sizeof(Block) * CHAR_BIT / BitsPerToken;
Block getTokenFromBlock(Block b, int nt) noexcept
{
return (b >> (nt * BitsPerToken)) & ((1UL << (BitsPerToken)) - 1);
}
bool seqEqual(Block const* seqA, int startA, int endA, Block const* seqB, int startB, int endB) noexcept
{
using CompareBlock = uint64_t;
constexpr int TokenPerCompareBlock = sizeof(CompareBlock) * CHAR_BIT / BitsPerToken;
const int len = endA - startA;
int posA = startA;
int posB = startB;
CompareBlock curA = 0;
CompareBlock curB = 0;
for (int i = 0; i < len; ++i, ++posA, ++posB)
{
const int cmpIdx = i % TokenPerBlock;
const int blockA = posA / TokenPerBlock;
const int idxA = posA % TokenPerBlock;
const int blockB = posB / TokenPerBlock;
const int idxB = posB % TokenPerBlock;
if ((i % TokenPerCompareBlock) == 0)
{
if (curA != curB)
return false;
curA = 0;
curB = 0;
}
curA += getTokenFromBlock(seqA[blockA], idxA) << (BitsPerToken * cmpIdx);
curB += getTokenFromBlock(seqB[blockB], idxB) << (BitsPerToken * cmpIdx);
}
if (curA != curB)
return false;
return true;
}
I figured that this should be quite fast (comparing 32 tokens simultaneously), but it is more than two times slower than using an std::string
(with each token stored in a char
) and its operator==
.
I have looked into std::memcmp
, but cannot use it because the subsequence might start somewhere within a byte (at a multiple of 2 bits, though).
Another candidate would be boost::dynamic_bitset
, which basically implements the same storage format. However, it does not include equality tests.
How can I achieve fast equality tests using this compressed format?