First question, does std::string::erase
reallocate?
Second question, are there any faster method to quickly erase certain words or phrase from a std::string
? The length of the string is usually around 300K.
First question, does std::string::erase
reallocate?
Second question, are there any faster method to quickly erase certain words or phrase from a std::string
? The length of the string is usually around 300K.
It is not defined if string::erase
is going to trigger a reallocation. You can check by comparing string::capacity
to after and before calling the method to see what happens. Removing parts of a string is always going to trigger a copy of all characters that come after the erased parts, since the storage of a string is required to be continuous.
For operations on large strings you might want to consider using a rope or a std::list instead. This might turn out faster depending on what you do.
21.4.1/3
No erase() or pop_back() member function shall throw any exceptions.
Since no such restriction exists on the allocator, I think that it is safe to say that no, std::string::erase
does not, and can not, reallocate.
You might want to have a look at rope
. It is a heavy-duty string (get it?) designed for large strings, with much faster substring operations. Unfortunately, it isn't part of the std
, but rather a common addition (in SGI, STLPort and GNU's libstdc++).
It's already been mentioned that it's implementation dependent whether std::string::erase triggers a reallocation. so I wanted to focus on the string searching. The traditional approach to this problem would be to use the Aho-Corasick algorithm.
Alternatively, David Musser wrote a paper on searching for needles (substrings) in large haystacks (strings) using a hybrid of the Boyer-Moore and Knuth-Morris-Pratt algorithms. The paper is available here. Adapting this would probably be simpler than rolling an Aho-Corasick implementation.
Musser's approach exhibits must faster behavior than the naive search and replace. It should be possible to adapt the algorithm for your purposes by modifying the BM skip loop and KNP lookup table to account for all of the needles that you are looking to replace. Allocate an output buffer in advance and iteratively construct the output string by appending to it all non-matching segments of the haystack. This approach will get less effective as the number of needles grows and the BM/KNP lookups saturate.
From my realisation of STL I can see that condition of reallocating string during std::string::erase
is:
if (__new_size > this->capacity() || _M_rep()->_M_is_shared())
I think it means that string isn't reallocating during erase
call.
std::string::erase
does not reallocate - because it does not need to and because it's C++ philosophy that you don't pay (reallocation time) for what you don't need.First thing to do is of course find a fast algorithm to find the words/phrases that you want to remove. Then, if there's only one chunk to erase, std::string::erase
should be perfectly suited for your needs.
However, if for example you have the string "000aa11111bbbbb2222222c3333333333" and want to erase all phrases containing letters, just finding and erasing them one after another will lead to multiple copies of the remainder of the string - the '1's will get copied once, '2's will get copied twice and so on. So if there are many phrases to erase in the string, there will be a possibility to improve performance - just copy the chunks that should remain in the string individually and overwrite the chunks you want to erase:
(| denotes an iterator until which the string is "correct"):
That way, you have to copy every character after the first erased phrase exactly once.
I'm using VC6 from MS and this last DO reallocate buffer on std::string::erase() call. I had to remove all erase() calls from my code as I'm sometimes using big strings and I found some big slow down due to this. So care about your compiler and avoid erase(). Personally, I use reaffectations str = ""; as a workaround.