2

First question, does std::string::erase reallocate?

Second question, are there any faster method to quickly erase certain words or phrase from a std::string? The length of the string is usually around 300K.

Matthieu M.
  • 287,565
  • 48
  • 449
  • 722
CLearner
  • 532
  • 1
  • 6
  • 17
  • 3
    `std::erase` doesnt exist – Karthik T Feb 20 '13 at 07:39
  • 1
    I suppose you mean `std::string::erase`, as there is no `std::erase`. – pmr Feb 20 '13 at 07:40
  • Whoops, I mean std:string::erase. – CLearner Feb 20 '13 at 07:42
  • 1
    If you have a list of words you want to remove from a string, you might want to try the [regular expression library](http://en.cppreference.com/w/cpp/regex) instead of smashing around using repeated `erase` calls, each of which will be moving considerable amounts of data around. Deleting a word from the beginning of the string can be expensive, and this cost compounds for each additional word removed. – tadman Feb 20 '13 at 07:49
  • @tadman I will consider your suggestion. – CLearner Feb 20 '13 at 07:52
  • [boost::algorithm::erase_all](http://www.boost.org/doc/libs/release/doc/html/boost/algorithm/erase_all.html) is a reasonable option for large strings. – Mankarse Feb 20 '13 at 07:52
  • If there are multiple words to be erased, then `erase_all` will have to be run multiple times. This could be very slow for large lists. – tadman Feb 20 '13 at 07:58

7 Answers7

5

It is not defined if string::erase is going to trigger a reallocation. You can check by comparing string::capacity to after and before calling the method to see what happens. Removing parts of a string is always going to trigger a copy of all characters that come after the erased parts, since the storage of a string is required to be continuous.

For operations on large strings you might want to consider using a rope or a std::list instead. This might turn out faster depending on what you do.

pmr
  • 58,701
  • 10
  • 113
  • 156
4

21.4.1/3

No erase() or pop_back() member function shall throw any exceptions.

Since no such restriction exists on the allocator, I think that it is safe to say that no, std::string::erase does not, and can not, reallocate.

Benjamin Lindley
  • 101,917
  • 9
  • 204
  • 274
  • 1
    Since it should either do nothing or shrink the string, it doesn't seem like an allocation would ever be required. – tadman Feb 20 '13 at 07:59
  • @tadman You'd think that was the case for shrink_to_fit as well, but that one *can* reallocate depending on implementation. ¯\\_(ツ)_/¯ – Johann Studanski Mar 06 '20 at 12:41
  • @JohannStudanski If it failed to allocate more memory it could always just go back to the allocation it already has. – tadman Mar 06 '20 at 18:20
2

You might want to have a look at rope. It is a heavy-duty string (get it?) designed for large strings, with much faster substring operations. Unfortunately, it isn't part of the std, but rather a common addition (in SGI, STLPort and GNU's libstdc++).

See STL Rope - when and where to use

Community
  • 1
  • 1
Alex Chamberlain
  • 4,147
  • 2
  • 22
  • 49
2

It's already been mentioned that it's implementation dependent whether std::string::erase triggers a reallocation. so I wanted to focus on the string searching. The traditional approach to this problem would be to use the Aho-Corasick algorithm.

Alternatively, David Musser wrote a paper on searching for needles (substrings) in large haystacks (strings) using a hybrid of the Boyer-Moore and Knuth-Morris-Pratt algorithms. The paper is available here. Adapting this would probably be simpler than rolling an Aho-Corasick implementation.

Musser's approach exhibits must faster behavior than the naive search and replace. It should be possible to adapt the algorithm for your purposes by modifying the BM skip loop and KNP lookup table to account for all of the needles that you are looking to replace. Allocate an output buffer in advance and iteratively construct the output string by appending to it all non-matching segments of the haystack. This approach will get less effective as the number of needles grows and the BM/KNP lookups saturate.

Chris Hayden
  • 1,104
  • 6
  • 6
1

From my realisation of STL I can see that condition of reallocating string during std::string::erase is: if (__new_size > this->capacity() || _M_rep()->_M_is_shared()) I think it means that string isn't reallocating during erase call.

1
  1. No, std::string::erase does not reallocate - because it does not need to and because it's C++ philosophy that you don't pay (reallocation time) for what you don't need.
  2. Depends on what you want to erase and what you mean with quickly (quickly to type or to perform).

First thing to do is of course find a fast algorithm to find the words/phrases that you want to remove. Then, if there's only one chunk to erase, std::string::erase should be perfectly suited for your needs. However, if for example you have the string "000aa11111bbbbb2222222c3333333333" and want to erase all phrases containing letters, just finding and erasing them one after another will lead to multiple copies of the remainder of the string - the '1's will get copied once, '2's will get copied twice and so on. So if there are many phrases to erase in the string, there will be a possibility to improve performance - just copy the chunks that should remain in the string individually and overwrite the chunks you want to erase: (| denotes an iterator until which the string is "correct"):

  • "000|aa11111bbbbb2222222c3333333333"
  • "00011111|11bbbbb2222222c3333333333"
  • "000111112222222|2222222c3333333333"
  • "0001111122222223333333333|33333333"
  • "0001111122222223333333333"

That way, you have to copy every character after the first erased phrase exactly once.

Arne Mertz
  • 24,171
  • 3
  • 51
  • 90
0

I'm using VC6 from MS and this last DO reallocate buffer on std::string::erase() call. I had to remove all erase() calls from my code as I'm sometimes using big strings and I found some big slow down due to this. So care about your compiler and avoid erase(). Personally, I use reaffectations str = ""; as a workaround.

xbrain
  • 1
  • 3