5

Perhaps I'm overlooking something obvious but I was wondering what the fastest way to implement whole-word string replacement in C++ might be. At first I considered simply concatenating spaces to the search word, but this does not consider the string boundaries or punctuation.

This is my current abstraction for (non-whole-word) replacement:

void Replace(wstring& input, wstring find, wstring replace_with) {
  if (find.empty() || find == replace_with || input.length() < find.length()) {
      return;
  }
  for (size_t pos = input.find(find); 
              pos != wstring::npos; 
              pos = input.find(find, pos)) {

      input.replace(pos, find.length(), replace_with);
      pos += replace_with.length();
  }
}

If I only consider spaces as a word boundary, I could probably implement this by comparing the beginning and end of the search string against the find string to cover the string boundaries, and then following with a Replace(L' ' + find + L' ').... but I was wondering if there was a more elegant solution that would include punctuation efficiently.

Let's consider a word to be any collection of characters that is separated by either whitespace or punctuation (to keep it simple let's say !"#$%&'()*+,-./ at minimum -- which happen to correspond to (c > 31 && c < 48)).

In my application I have to call this function over a rather large array of short strings, which may include various Unicode which I don't want to split new words. I would also like to avoid including any external libraries, but STL is fine.

The purpose of not using regular expressions is the promise of less overhead and the goal of a fast function suited to this particular task over a large dataset.

sakatc
  • 309
  • 1
  • 2
  • 9
  • 1
    Side note: Replace can be very slow if the input is long and you do replacements at the beginning. I would recommend concatenating to a string buffer (for example std::stringstream), then overwriting the input in one step. – Notinlist May 09 '11 at 20:16
  • 1
    The unicode requirement is going to make things a lot trickier. I know you're trying to avoid regexes and adding libraries, but you could look into [ICU](http://site.icu-project.org/) - it has a regular expression-based replace function ([regex docs](http://userguide.icu-project.org/strings/regexp)), and will let you use the \b "word boundary" metacharacter. – Xavier Holt May 09 '11 at 20:38

2 Answers2

3

I think you can do this, both doing whole-word matching and doing it efficiently. The key is to:

  • detect "whole-word" boundaries using 'std::isalpha', which should work with Unicode & any locale.
  • do the replace "out of place" by creating a separate 'output' string that you swap with 'input' at the end of processing, instead of doing the work "in place" on the 'input' string itself.

Here's my take on your function:

#include <cctype> // isalpha
#include <ciso646> // or, not
#include <string> // wstring

using std::size_t;
using std::wstring;

/// @brief Do a "find and replace" on a string.
/// @note This function does "whole-word" matching.
/// @param[in,out] input_string The string to operate on.
/// @param[in] find_string The string to find in the input.
/// @param[in] replace_string The string to replace 'find_string'
///            with in the input.
void find_and_replace( wstring& input_string,
                       const wstring& find_string,
                       const wstring& replace_string )
{
  if( find_string.empty()
      or find_string == replace_string
      or input_string.length() < find_string.length() )
  {
    return;
  }

  wstring output_string;
  output_string.reserve( input_string.length() );
  size_t last_pos = 0u;
  for( size_t new_pos = input_string.find( find_string );
       new_pos != wstring::npos;
       new_pos = input_string.find( find_string, new_pos ) )
  {
    bool did_replace = false;
    if( ( new_pos == 0u
          or not std::isalpha( input_string.at( new_pos - 1u ) ) )
        and ( new_pos + find_string.length() == input_string.length()
              or not std::isalpha( input_string.at( new_pos + find_string.length() ) ) ) )
    {
      output_string.append( input_string, last_pos, new_pos - last_pos );
      output_string.append( replace_string );
      did_replace = true;
    }
    new_pos += find_string.length();
    if( did_replace )
    {
      last_pos = new_pos;
    }
  }
  output_string.append( input_string, last_pos,
                        input_string.length() - last_pos );

  input_string.swap( output_string );
}

P.S. I was unsure what 'replace_all' was trying to accomplish in your initial example, so I removed it from my solution for clarity.

P.P.S. This code would be much cleaner with Regex-es. Can you rely on C++ TR1 or C++ 2011 functionality? They provide a standard 'regex' library.

Charles L Wilcox
  • 1,126
  • 8
  • 18
  • Thinking about this overnight a bit, and seeing @Code_So1dier's response, I should have noted that what defines a "whole-word" in your question is a bit nebulous right now. Is it strictly whitespace, just non alphabetic characters or non alphanumeric characters? That decision would change the logic of the boundary-check done inside the for loop for my example. For example, if just "whole-word" boundaries are white-space, then replace `not std::isalpha( input_string.at( new_pos + find_string.length() ) )` with `std::isspace( input_string.at( new_pos + find_string.length() ) )`. – Charles L Wilcox May 12 '11 at 13:09
  • In my application I probably only need to worry about whitespace and punctuation, and splitting on all unicode is undesirable (edited question). The goal in not using regular expressions is the promise of less overhead and the hopes of a faster function suited to this single task. It's true that a couple \b's is probably enough for most applications. – sakatc May 13 '11 at 05:32
  • @sakatc Okay, so whitespace or punctuation counts is a "whole-word" boundary; modifing my example by replacing `not std::isalpha( input_string.at( new_pos + find_string.length() ) )` with `std::isspace( input_string.at( new_pos + find_string.length() ) ) or std::ispunct( input_string.at( new_pos + find_string.length() ) )` should do the trick. One note, do you want your method to restrict the contents of the argument 'find_string'? For example it'd be confusing if the user submitted "/test ", as it contains both classes of "whole-word" delimiting characters. – Charles L Wilcox May 13 '11 at 12:09
  • I really don't think consideration for word boundaries in the search word should be necessary. If the programmer is trying to match multiple-word strings with a whole word search, then they have essentially asked the function to fail already.... I wouldn't do too much extra work on it. Matching for example adjacent words like "foo bar" is beyond the scope of what I need, but if you want to explore it knock yourself out~ – sakatc May 15 '11 at 07:28
  • also btw, the replace_all from before was meant for recursive replacements and doesn't really apply to whole word replace. For example to reduce redundant newlines from a buffer you could do `Replace(buffer,"\n\n","\n",true);`... I'll remove it from the question since it doesn't apply here. – sakatc May 15 '11 at 07:35
1

This is my quick response, but don't know about how quick the solution is... There are few solutions to this problem:
1. By using iterators, compare each word (delimited by space), recreating string for each occurrence:

string& remove_all_occurences(string& s, const string& str_to_remove, const string& str_to_put){
                typedef string::size_type string_size;
                string_size i = 0;
                string cur_string;
                cur_string.reserve(s.size());

                // invariant: we have processed characters [original value of i, i) 
                while (i != s.size()) {
                // ignore leading blanks
                // invariant: characters in range [original i, current i) are all spaces
                    while (i != s.size() && isspace(s[i]))
                    ++i;

                    // find end of next word
                    string_size j = i;
                    // invariant: none of the characters in range [original j, current j)is a space
                     while (j != s.size() && !isspace(s[j]))
                        j++;
                        // if we found some nonwhitespace characters 


                    if (i != j) {
                        // copy from s starting at the beginning to i, placing str to replace, and finishing with j to the end of s
                        cur_string = s.substr(i,j-i);
                        if(cur_string == str_to_remove){
                            s = s.substr(0,i) + str_to_put + s.substr(j,s.size() - j);
                        }
                        i = j;
                    }
                }
                return s;
            }

Testing the program:

void call_remove_all_occurences(){
                string my_str = "The quick brown fox jumps over sleepy dog fox fox fox";
                cout << remove_all_occurences(my_str,"fox","godzilla") << endl;
            }

Output:

The quick brown godzilla jumps over sleepy dog godzilla godzilla godzilla
  1. By splitting string into vector and than going through vector and replacing each occurence - simple... don't have the code, but you get the idea...
Code_So1dier
  • 921
  • 1
  • 6
  • 15