C++ Optimization for string compare and replace

Question

I am trying to optimize this function I have written that compares two strings, then replaces the characters in the first string if they are not found in the second. Do you think, for example converting the string to a vector of chars when changing to uppercase would yield a performance boost? I can't see many ways around the 2 for loops however. Any general tips would be appreciated!

void optimize(std::string & toBeProcessed, const std::string & toBeIgnored, char ch)
{
    std::string upperProcessed = toBeProcessed;
    std::transform(upperProcessed.begin(), upperProcessed.end(), upperProcessed.begin(), ::toupper);
    std::string upperIgnored = toBeIgnored;
    std::transform(upperIgnored.begin(), upperIgnored.end(), upperIgnored.begin(), ::toupper);
    std::vector<char> vectorAfterProcessed;
    bool found;
    for(int i = 0; i <= upperProcessed.size(); i++)
    {
        found = false;
        for(int j = 0; j <= upperIgnored.size(); j++)
        {
            if(upperProcessed[i] == upperIgnored[j])
            {
                vectorAfterProcessed.push_back(upperProcessed[i]);
                found = true;
            }
        }
        if(found != true)
        {
            vectorAfterProcessed.push_back(ch);
        }
    }
    std::string test(vectorAfterProcessed.begin(), vectorAfterProcessed.end());
}

Instead of a second loop you could just use [`std::string::find`](http://www.cplusplus.com/reference/string/string/find/). Not sure how much of a performance boost it is, but it would be more readable. — dwcanillas, May 06 '15 at 19:47
This: `i <= upperProcessed.size()` will go out of bounds. Same with `j` loop. — Galik, May 06 '15 at 19:55
is this a real performance bottleneck in your code? or are you trying to optimize "just because"? — The Paramagnetic Croissant, May 06 '15 at 19:55
One thing -- If you're using a C++ 11 compliant compiler, you might as well pass the strings by value instead of passing by reference and then copying those strings in the function. http://stackoverflow.com/questions/7592630/is-pass-by-value-a-reasonable-default-in-c11 — PaulMcKenzie, May 06 '15 at 20:07
Haha passing by reference is the one thing we can't change, sorry should have mentioned — user3027864, May 06 '15 at 20:09
@user3027864 `passing by reference is the one thing we can't change` Well, given the link, you might reconsider if it's C++11 and that function is called a substantial number of times. You asked for optimization, well... — PaulMcKenzie, May 06 '15 at 20:12
@user3027864 Also, why do you need a vector? Just declare a string and apply `+=` to it. — PaulMcKenzie, May 06 '15 at 20:17
If the expected size of a vector is known then reserve. In your code call `vectorAfterProcessed.reserve(upperProcessed.size() * upperIgnored.size())`. With this, push_backs will be faster — Edwin Rodríguez, May 06 '15 at 20:31

Ishamael · Answer 1 · 2015-05-07T05:47:21.757

Note that char can hold only 256 values. You can just scan ignored string once, and populate a bitmask of characters that occur in it:

uint32_t bitmask[8] = {0};
for(int j = 0; j < upperIgnored.size(); j++)
{
    uint8_t chr = static_cast<uint8_t>(upperIgnored[j]);
    bitmask[chr >> 5] |= (1 << (chr & 31));
}

After that instead of the inner loop just check the value of the bitmask:

for(int i = 0; i < upperProcessed.size(); i++)
{
    uint8_t chr = static_cast<uint8_t>(upperProcessed[i]);
    if(bitmask[chr >> 5] & (1 << (chr & 31)))
    {
        vectorAfterProcessed.push_back(upperProcessed[i]);
    }
    else
    {
        vectorAfterProcessed.push_back(ch);
    }
}

Also note that your code had two other issues: your loops had their right end inclusive, which would very likely lead to a segfault/access violation, and you were not breaking after setting found to true, which would result in character being appended to the processed string multiple times if it occurs multiple times in the ignored string.

Treating it as a bitmask is neat! Now I'm tempted to compare it to the less cache-friendly `bool used[256]`. — , May 07 '15 at 04:50
I would replace the vector with a std::string, and retest to see if the `+=` of a string is more efficient than a `vector::push_back`. — PaulMcKenzie, May 07 '15 at 13:26

score 1 · Answer 2 · edited Feb 18 '16 at 22:50

I was in the middle of arriving similar to Ishmael's solution, only I was tempted to just use the 256-byte boolean array as opposed to Ishamel's more cache-friendly 64-byte bitmask array.

So I was really curious about how these perform against each other and whipped up a quick benchmark.

Benchmark Games

#include <string>
#include <algorithm>
#include <iostream>
#include <cassert>
#include <vector>
#include <ctime>
#include <cctype>

using namespace std;    

static string optimize_original(string& toBeProcessed, const string& toBeIgnored, char ch)
{
    string upperProcessed = toBeProcessed;
    transform(upperProcessed.begin(), upperProcessed.end(), upperProcessed.begin(), ::toupper);
    string upperIgnored = toBeIgnored;
    transform(upperIgnored.begin(), upperIgnored.end(), upperIgnored.begin(), ::toupper);
    vector<char> vectorAfterProcessed;
    bool found;
    for(size_t i = 0; i <= upperProcessed.size(); i++)
    {
        found = false;
        for(size_t j = 0; j <= upperIgnored.size(); j++)
        {
            if(upperProcessed[i] == upperIgnored[j])
            {
                vectorAfterProcessed.push_back(upperProcessed[i]);
                found = true;
            }
        }
        if(found != true)
            vectorAfterProcessed.push_back(ch);
    }
    return string(vectorAfterProcessed.begin(), vectorAfterProcessed.end());
}

static string optimize_paul(string toBeProcessed, string toBeIgnored, char ch)
{
    transform(toBeProcessed.begin(), toBeProcessed.end(), toBeProcessed.begin(), ::toupper);
    transform(toBeIgnored.begin(), toBeIgnored.end(), toBeIgnored.begin(), ::toupper);
    string test;
    size_t start = 0;
    while (start < toBeProcessed.size())
    {
        size_t n = toBeProcessed.find_first_not_of(toBeIgnored, start);
        if ( n != string::npos)
        { 
            toBeProcessed[n] = ch;
            start = n+1;
        }
        else
            break;
    }
    return toBeProcessed;
}

static string optimize_ike(string input, const string& to_keep, char rep)
{
    bool used[256] = {false};
    for (size_t j=0; j < to_keep.size(); ++j)
    {
        used[tolower(to_keep[j])] = true;
        used[toupper(to_keep[j])] = true;
    }
    for (size_t j=0; j < input.size(); ++j)
    {
        if (used[input[j]])
            input[j] = toupper(input[j]);
        else
            input[j] = rep;
    }
    return input;
}

static string optimize_ishmael(string input, const string& to_keep, char rep)
{
    uint32_t bitmask[8] = {0};
    for (size_t j=0; j < to_keep.size(); ++j)
    {
        const uint8_t lower = static_cast<uint8_t>(tolower(to_keep[j]));
        bitmask[lower >> 5] |= (1 << (lower & 31));

        const uint8_t upper = static_cast<uint8_t>(toupper(to_keep[j]));
        bitmask[upper >> 5] |= (1 << (upper & 31));
    }
    for (size_t j=0; j < input.size(); ++j)
    {
        const uint8_t chr = static_cast<uint8_t>(input[j]);
        if (bitmask[chr >> 5] & (1 << (chr & 31)))
            input[j] = toupper(input[j]);
        else
            input[j] = rep;
    }
    return input;
}

static double sys_time()
{
    return static_cast<double>(clock()) / CLOCKS_PER_SEC;
}

enum {string_len = 10000000};
enum {num_tests = 5};

int main()
{
    const string to_keep = "abcd";
    for (int k=0; k < 5; ++k)
    {
        string in;
        for (int j=0; j < string_len; ++j)
            in += rand() % 26 + 'A';

        double time = sys_time();
        volatile const string a = optimize_original(in, to_keep, '*');
        cout << ((sys_time() - time) * 1000) << " ms for original" << endl;

        time = sys_time();
        volatile const string b = optimize_paul(in, to_keep, '*');
        cout << ((sys_time() - time) * 1000) << " ms for Paul's" << endl;

        time = sys_time();
        volatile const string c = optimize_ike(in, to_keep, '*');
        cout << ((sys_time() - time) * 1000) << " ms for Ike's" << endl;

        time = sys_time();
        volatile const string d = optimize_ishmael(in, to_keep, '*');
        cout << ((sys_time() - time) * 1000) << " ms for Ishmael's" << endl;

        cout << endl;
    }
}

Results

515 ms for original
218 ms for Paul's
78 ms for Ike's
63 ms for Ishmael's

514 ms for original
203 ms for Paul's
78 ms for Ike's
73 ms for Ishmael's

515 ms for original
218 ms for Paul's
78 ms for Ike's
63 ms for Ishmael's

515 ms for original
202 ms for Paul's
67 ms for Ike's
62 ms for Ishmael's

515 ms for original
218 ms for Paul's
78 ms for Ike's
62 ms for Ishmael's

Winner -- Ishamel

The winner appears to be Ishmael when it comes to speed, arriving at not only the theoretically fastest solution at O(N+M) [the original is O(N*M)] but also the most micro-efficient.

I believe his solution is clearly superior to mine. I just wanted to offer the benchmarks comparing all of these for posterity.

Paul's solution is perhaps the most elegant from a modern C++ perspective, utilizing what's available and standard to replace the inner loop with higher-level logic. Speed isn't always (or even usually) everything.

score 0 · Answer 3 · answered May 06 '15 at 21:14

Note: Whether this is faster or not, you will have to profile the code and determine this. In addition, the program below has not been tested for all input cases.

If your goal is to search for characters that are not in a string, instead of nested search loops (and vectors), usage of find_first_not_of in a loop would (should) do the job.

#include <string>
#include <algorithm>
#include <iostream>
#include <cctype>

std::string optimize(std::string toBeProcessed, std::string toBeIgnored, char ch)
{
    std::transform(toBeProcessed.begin(), toBeProcessed.end(), toBeProcessed.begin(), ::toupper);
    std::transform(toBeIgnored.begin(), toBeIgnored.end(), toBeIgnored.begin(), ::toupper);
    std::string test;
    size_t start = 0;
    while (start < toBeProcessed.size())
    {
        size_t n = toBeProcessed.find_first_not_of(toBeIgnored, start);
        if ( n != std::string::npos)
        { 
            toBeProcessed[n] = ch;
            start = n+1;
        }
        else
            break;
    }
    return toBeProcessed;
}

int main()
{
    std::string out = optimize("abc123", "abc1", 'x');
    std::cout << out;
}

Live Example: http://ideone.com/RsB37f

This has not been tested for all inputs, but this illustrates the basic point. I am replacing the string in place, as opposed to creating a vector (or even a new string) and building a string from scratch.

In addition, I passed the parameters by value, since it is advantageous to do this with a C++ 11 compliant compiler. Even if you didn't have a C++ 11 compiler, you really don't lose anything by doing this, since in your original example, you were copying the passed-in strings to local variables.

C++ Optimization for string compare and replace

3 Answers3

Benchmark Games

Results

Winner -- Ishamel