3

I am writing a C++ algorithm that takes two strings and returns true if you can mutate from string a to string b by changing a single character to another. The two strings must equal in size, and can only have one difference. I also need to have access to the index that changed, and the character of strA that was altered. I found a working algorithm, but it iterates through every single pair of words and is running way too slow on any large amount of input.

bool canChange(std::string const& strA, std::string const& strB, char& letter)
{
    int dif = 0;
    int position = 0;
    int currentSize = (int)strA.size();
    if(currentSize != (int)strB.size())
    {
        return false;
    }
    for(int i = 0; i < currentSize; ++i)
    {
        if(strA[i] != strB[i])
        {
            dif++;
            position = i;
            if(dif > 1)
            {
                return false;
            }
        }
    }
    if(dif == 1)
    {
        letter = strA[position];
        return true;
    }
    else return false;
}

Any advice on optimization?

dave vader
  • 89
  • 1
  • 6
  • what kind of optimization you have in mind? time? memory? by the way you haven't addressed the index of the character that has been changed you are just returning the character itself – Pooya Jan 26 '16 at 00:37
  • 3
    I can see `canChange()` (funny name, at that) checking pairs of characters - can you elaborate on `every single pair of words`? – greybeard Jan 26 '16 at 00:51
  • I think it is the "iterate through every single pair of words" that is the source of the slowness; the posted code looks pretty efficient. If you are comparing every word in text 1 against every word in text 2, that is an O(N^2) operation and it's going to be slow regardless of how much you optimize canChange(). – Jeremy Friesner Jan 26 '16 at 00:54
  • How slow is *way too slow*? What compile options did you use, and what did your micro-benchmark look like? What hardware did you run it on? Intel Haswell? Intel Core2? Micro-benchmarking is *hard* with modern CPUs, thanks to caches, branch prediction, and compilers that don't know when *not* to optimize away a loop that repeats the same work. Compiling without optimization enable is *not* an option, esp. not for normal C++ that requires lots of inlining of wrapper functions to produce acceptable asm. – Peter Cordes Jan 27 '16 at 05:53
  • Your code unfortunately doesn't auto-vectorize with gcc or clang. It compiles to a fairly tight loop that checks one byte per iteration. http://goo.gl/AAEA3A (the loop from `.L4` to `jg .L4`). It has one conditional branch inside the loop that's normally taken. – Peter Cordes Jan 27 '16 at 05:57

3 Answers3

3

It's a bit hard to get away from examining all the characters in the strings, unless you can accept the occasional incorrect result.

I suggest using features of the standard library, and not trying to count the number of mismatches. For example;

#include <string>
#include <algorithm>

bool canChange(std::string const& strA, std::string const& strB, char& letter, std::size_t &index)
{
     bool single_mismatch = false;
     if (strA.size() == strB.size())
     {
         typedef std::string::const_iterator ci; 
         typedef std::pair<ci, ci> mismatch_result;

         ci begA(strA.begin()), endA(strA.end());

         mismatch_result result = std::mismatch(begA, endA, strB.begin());

         if (result.first != endA)    //  found a mismatch
         {
             letter = *(result.first);
             index = std::distance(begA, result.first);

             // now look for a second mismatch

             std::advance(result.first, 1);
             std::advance(result.second, 1);

             single_mismatch = (std::mismatch(result.first, endA, result.second).first == endA);
         }
    }
    return single_mismatch;
}

This works for all versions. It can be simplified a little in C++11.

If the above returns true, then a single mismatch was found.

If the return value is false, then either the strings are different sizes, or the number of mismatches is not equal to 1 (either the strings are equal, or have more than one mismatch).

letter and index are unchanged if the strings are of different lengths or are exactly equal, but otherwise identify the first mismatch (value of the character in strA, and index).

Peter
  • 35,646
  • 4
  • 32
  • 74
2

If you want to optimize for mostly-identical strings, you could use x86 SSE/AVX vector instructions. Your basic idea looks fine: break as soon as you detect a second difference.

To find and count character differences, a sequence like PCMPEQB / PMOVMSKB / test-and-branch is probably good. (Use C/C++ intrinsic functions to get those vector instructions). When your vector loop detects non-zero differences in the current block, POPCNT the bitmask to see if you just found the first difference, or if you found two differences in the same block.

I threw together an untested and not-fully-fleshed out AVX2 version of what I'm describing. This code assumes string lengths are a multiple of 32. Stopping early and handling the last chunk with a cleanup epilogue is left as an exercise for the reader.

#include <immintrin.h>
#include <string>

// not tested, and doesn't avoid reading past the end of the string.
// TODO: epilogue to handle the last up-to-31 left-over bytes separately.
bool canChange_avx2_bmi(std::string const& strA, std::string const& strB, char& letter) {
    size_t size = strA.size();
    if (size != strB.size())
        return false;

    int diffs = 0;
    size_t diffpos = 0;
    size_t pos = 0;
    do {
        uint32_t diffmask = 0;
        while( pos < size ) {
            __m256i vecA  = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(& strA[pos]));
            __m256i vecB  = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(& strB[pos]));
            __m256i vdiff = _mm256_cmpeq_epi8(vecA, vecB);
            diffmask = _mm256_movemask_epi8(vdiff);
            pos += 32;
            if (diffmask) break;  // gcc makes worse code if you include && !diffmask in the while condition, instead of this break
        }
        if (diffmask) {
            diffpos = pos + _tzcnt_u32(diffmask);  // position of the lowest set bit.  Could safely use BSF rather than TZCNT here, since we only run when diffmask is non-zero.
            diffs += _mm_popcnt_u32(diffmask);
        }
    } while(pos < size && diffs <= 1);

    if (diffs == 1) {
        letter = strA[diffpos];
        return true;
    }
    return false;
}

The ugly break instead of including that in the while condition apparently helps gcc generate better code. The do{}while() also matches up with how I want the asm to come out. I didn't try using a for or while loop to see what gcc would do.

The inner loop is really tight this way:

.L14:
        cmp     rcx, r8
        jnb     .L10      #  the while(pos<size) condition
.L6: # entry point for first iteration, because gcc duplicates the pos<size test ahead of the loop

        vmovdqu ymm0, YMMWORD PTR [r9+rcx]        # tmp118,* pos
        vpcmpeqb        ymm0, ymm0, YMMWORD PTR [r10+rcx]       # tmp123, tmp118,* pos
        add     rcx, 32   # pos,
        vpmovmskb       eax, ymm0     # tmp121, tmp123
        test    eax, eax        # tmp121
        je      .L14        #,

In theory, this should run at one iteration per 2 clocks (Intel Haswell). There are 7 fused-domain uops in the loop. (Would be 6, but 2-reg addressing modes apparently can't micro-fuse on SnB-family CPUs.) Since two of the uops are loads, not ALU, this throughput might be achievable on SnB/IvB as well.

This should be exceptionally good for flying over regions where the two strings are identical. The overhead of correctly handling arbitrary string lengths will make this potentially slower than a simple scalar function if strings are short, and/or have multiple differences early on.

Community
  • 1
  • 1
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Instead of popcnt, you could just use parity to see if the popcnt is odd or even. But the fastest way to find the parity of a 32-bit integer on x86 (with AVX2 available) is with `popcnt(x) & 1`. An SSE version that doesn't require SSE4 might want to do that. (x86 can do 16-bit parity with `xor al, ah` / `setp al` (PF is set on even parity, IIRC). – Peter Cordes Nov 30 '17 at 03:36
1

How big is your input?

I'd think that the strA[i], strB[i] has function call overhead unless it's inlined. So make sure you do your performance test with inlining turned on and compiled with release. Otherwise, try getting the bytes as a char* with strA.c_str().

If all that fails and it's still not fast enough, try breaking you string into chunks and using memcmp or strncmp on the chunks. If no difference, move to the next chunk until you reach the end or find a difference. If a difference is found, do your trivial byte by byte compare until you find the difference. I suggest this route because memcmp is often faster than your trivial implementations as they can make use of the processor SSE extensions and so forth to do very fast compares.

Also, there is a problem with your code. You're assuming strA is longer than strB and only checking the length of A for the array accessors.

hookenz
  • 36,432
  • 45
  • 177
  • 286