If you want to optimize for mostly-identical strings, you could use x86 SSE/AVX vector instructions. Your basic idea looks fine: break as soon as you detect a second difference.
To find and count character differences, a sequence like PCMPEQB
/ PMOVMSKB
/ test-and-branch is probably good. (Use C/C++ intrinsic functions to get those vector instructions). When your vector loop detects non-zero differences in the current block, POPCNT
the bitmask to see if you just found the first difference, or if you found two differences in the same block.
I threw together an untested and not-fully-fleshed out AVX2 version of what I'm describing. This code assumes string lengths are a multiple of 32. Stopping early and handling the last chunk with a cleanup epilogue is left as an exercise for the reader.
#include <immintrin.h>
#include <string>
// not tested, and doesn't avoid reading past the end of the string.
// TODO: epilogue to handle the last up-to-31 left-over bytes separately.
bool canChange_avx2_bmi(std::string const& strA, std::string const& strB, char& letter) {
size_t size = strA.size();
if (size != strB.size())
return false;
int diffs = 0;
size_t diffpos = 0;
size_t pos = 0;
do {
uint32_t diffmask = 0;
while( pos < size ) {
__m256i vecA = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(& strA[pos]));
__m256i vecB = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(& strB[pos]));
__m256i vdiff = _mm256_cmpeq_epi8(vecA, vecB);
diffmask = _mm256_movemask_epi8(vdiff);
pos += 32;
if (diffmask) break; // gcc makes worse code if you include && !diffmask in the while condition, instead of this break
}
if (diffmask) {
diffpos = pos + _tzcnt_u32(diffmask); // position of the lowest set bit. Could safely use BSF rather than TZCNT here, since we only run when diffmask is non-zero.
diffs += _mm_popcnt_u32(diffmask);
}
} while(pos < size && diffs <= 1);
if (diffs == 1) {
letter = strA[diffpos];
return true;
}
return false;
}
The ugly break
instead of including that in the while
condition apparently helps gcc generate better code. The do{}while()
also matches up with how I want the asm to come out. I didn't try using a for
or while
loop to see what gcc would do.
The inner loop is really tight this way:
.L14:
cmp rcx, r8
jnb .L10 # the while(pos<size) condition
.L6: # entry point for first iteration, because gcc duplicates the pos<size test ahead of the loop
vmovdqu ymm0, YMMWORD PTR [r9+rcx] # tmp118,* pos
vpcmpeqb ymm0, ymm0, YMMWORD PTR [r10+rcx] # tmp123, tmp118,* pos
add rcx, 32 # pos,
vpmovmskb eax, ymm0 # tmp121, tmp123
test eax, eax # tmp121
je .L14 #,
In theory, this should run at one iteration per 2 clocks (Intel Haswell). There are 7 fused-domain uops in the loop. (Would be 6, but 2-reg addressing modes apparently can't micro-fuse on SnB-family CPUs.) Since two of the uops are loads, not ALU, this throughput might be achievable on SnB/IvB as well.
This should be exceptionally good for flying over regions where the two strings are identical. The overhead of correctly handling arbitrary string lengths will make this potentially slower than a simple scalar function if strings are short, and/or have multiple differences early on.