String concatenation complexity in C++ and Java

Question

Consider this piece of code:

public String joinWords(String[] words) {
    String sentence = "";
    for(String w : words) {
        sentence = sentence + w;
    }
    return sentence;
}

On each concatenation a new copy of the string is created, so that the overall complexity is O(n^2). Fortunately in Java we could solve this with a StringBuffer, which has O(1) complexity for each append, then the overall complexity would be O(n).

While in C++, std::string::append() has complexity of O(n), and I'm not clear about the complexity of stringstream.

In C++, are there methods like those in StringBuffer with the same complexity?

See this thread? http://stackoverflow.com/questions/2462951/c-equivalent-of-stringbuffer-stringbuilder — taocp, Mar 14 '13 at 03:14
Also, I'm not aware of any time complexity guarantees on `StringBuilder`/`StringBuffer`. — phs, Mar 14 '13 at 03:17
Well.. I saw that thread, but I don't quite agree in the complexity part. — ethanjyx, Mar 14 '13 at 03:19
@phs The thing is that string.append() has complexity O(n). In that thread it says that you can do with allocate memory in advance, but I do not see that as efficient. — ethanjyx, Mar 14 '13 at 03:24
@ethanjyx: If you know in advance how much memory you need, allocating it upfront is more efficient than resizing the array each time. In almost all cases, "resizing" an array doesn't just add memory to it, cause that's not usually possible. Resizing typically involves creating a new array, copying all the old entries into it, and deleting the old one. That's just busywork if you already know how much space you're going to need, which is easy to find since strings know their length. — cHao, Mar 14 '13 at 03:34
Whenever you talk about $O(n)$, you have to precisely define $n$ first. — Roland Illig, Mar 28 '17 at 03:48
@RolandIllig I don't agree. Usually people should be able to infer from the context what n is. — ethanjyx, Mar 30 '17 at 23:21
So what is it in this case: the number of strings, or the total length of the strings? Will everyone choose the correct one here? — Roland Illig, Mar 31 '17 at 20:18
(No they won't because there is a mistake in the question. Which is why the OP >needed< to be explicit in this case.) — Stephen C, Aug 25 '22 at 02:17

cHao · Accepted Answer · 2016-06-18T17:19:38.900

34

C++ strings are mutable, and pretty much as dynamically sizable as a StringBuffer. Unlike its equivalent in Java, this code wouldn't create a new string each time; it just appends to the current one.

std::string joinWords(std::vector<std::string> const &words) {
    std::string result;
    for (auto &word : words) {
        result += word;
    }
    return result;
}

This runs in linear time if you reserve the size you'll need beforehand. The question is whether looping over the vector to get sizes would be slower than letting the string auto-resize. That, i couldn't tell you. Time it. :)

If you don't want to use std::string itself for some reason (and you should consider it; it's a perfectly respectable class), C++ also has string streams.

#include <sstream>
...

std::string joinWords(std::vector<std::string> const &words) {
    std::ostringstream oss;
    for (auto &word : words) {
        oss << word;
    }
    return oss.str();
}

It's probably not any more efficient than using std::string, but it's a bit more flexible in other cases -- you can stringify just about any primitive type with it, as well as any type that has specified an operator <<(ostream&, its_type&) override.

edited Jun 18 '16 at 17:19

answered Mar 14 '13 at 03:43

cHao

84,970
20
145
172

Note that if your `std::string` uses an exponential overshoot reserving strategy, even without the reserve this is linear. If it grows by a factor of 1.5 whenit is too small, each char gets copied an average of 1/(1-1/1.5), or 3 times: and a constant factor or 3 or 4 means that we are still O(n). I am not aware if the standard mandates this strategy. – Yakk - Adam Nevraumont Mar 14 '13 at 04:21
Exponential resize seems common-sense enough that you may often see linear-time performance even with auto-resize...but i don't remember seeing any such requirement in the standard. (Aside from performance considerations, though, reserving the right size would be less likely to fragment the heap.) – cHao Jul 13 '13 at 16:56
"Unspecified, but generally up to linear in the new string length." mentioned in http://www.cplusplus.com/reference/string/string/operator+=/. Doesn't this mean it's O(N^2) – sww Jan 20 '21 at 04:55
@sww: It means the loop can be, in certain pathological scenarios. Those scenarios occur when you always append enough characters at a time so the backing array has to constantly resize (read: copy to a larger array). Hence the mention of preemptively `reserve`ing the size you'll need beforehand. But that's only an issue if the backing array doesn't exponentially resize (say, doubling or more each time it has to resize), which apparently is rare enough to not be too serious a cause for concern. Exponential resize reduces the loop to amortized linear time. – cHao Jan 20 '21 at 12:11

Stephen C · Answer 2 · 2022-08-27T03:53:46.933

This is somewhat tangential to your Question, but relevant nonetheless. (And too big for a comment!!)

On each concatenation a new copy of the string is created, so that the overall complexity is O(n^2).

In Java, the complexity of s1.concat(s2) or s1 + s2 is O(M1 + M2) where M1 and M2 are the respective String lengths. Turning that into the complexity of a sequence of concatenations is difficult in general. However, if you assume N concatenations of Strings of length M, then the complexity is indeed O(M * N * N) which matches what you said in the Question.

Fortunately in Java we could solve this with a StringBuffer, which has O(1) complexity for each append, then the overall complexity would be O(n).

In the StringBuilder case, the amortized complexity of N calls to sb.append(s) for strings of size M is O(M*N). The key word here is amortized. When you append characters to a StringBuilder, the implementation may need to expand its internal array. But the expansion strategy is to double the array's size. And if you do the math, you will see that each character in the buffer is going to be copied on average one extra time during the entire sequence of append calls. So the complexity of the entire sequence of appends still works out as O(M*N) ... and, as it happens M*N is the final string length.

So your end result is correct, but your statement about the complexity of a single call to append is not correct. (I understand what you mean, but the way you say it is facially incorrect.)

Finally, I'd note that in Java you should use StringBuilder rather than StringBuffer unless you need the buffer to be thread-safe.

Yakk - Adam Nevraumont · Answer 3 · 2013-03-14T04:06:07.567

As an example of a really simple structure that has O(n) complexity in C++11:

template<typename TChar>
struct StringAppender {
  std::vector<std::basic_string<TChar>> buff;
  StringAppender& operator+=( std::basic_string<TChar> v ) {
    buff.push_back(std::move(v));
    return *this;
  }
  explicit operator std::basic_string<TChar>() {
    std::basic_string<TChar> retval;
    std::size_t total = 0;
    for( auto&& s:buff )
      total+=s.size();
    retval.reserve(total+1);
    for( auto&& s:buff )
      retval += std::move(s);
    return retval;
  }
};

use:

StringAppender<char> append;
append += s1;
append += s2;
std::string s3 = append;

This takes O(n), where n is the number of characters.

Finally, if you know how long all of the strings are, just doing a reserve with enough room makes append or += take a total of O(n) time. But I agree that is awkward.

Use of std::move with the above StringAppender (ie, sa += std::move(s1)) will significantly increase performance for non-short strings (or using it with xvalues etc)

I do not know the complexity of std::ostringstream, but ostringstream is for pretty printing formatted output, or cases where high performance is not important. I mean, they aren't bad, and they may even out perform scripted/interpreted/bytecode languages, but if you are in a rush, you need something else.

As usual, you need to profile, because constant factors are important.

A rvalue-reference-to-this operator+ might also be a good one, but few compilers implement rvalue references to this.

Where does the `log n` come from? Also, why do you believe this is better than just appending to a string? You're copying every string twice (once into the parameter variable of `operator+=` and once into the final string). The standard exponential reallocation algorithm used by (most) std::string implementations copies each character an average of twice (give or take a bit). — rici, Mar 14 '13 at 03:46
You are right, no lg k or n factor. I copy each character twice, not each string. A 1.5 exponential realloc copies each character an average of 3 times if my napkin math is right. I move each string 4 times on average (once in, 1/(1-1/1.5) during reallocs), but that involves 0 character copies (ignoring short string optimization). In short, cache coherancy could easily make += beat me (its copies are cache friendly): but if you feed me xvalue strings, I think I can beat += hands down (1 copy per char). — Yakk - Adam Nevraumont, Mar 14 '13 at 04:04
I just checked libcxx (clang) and libstdc++ (gcc), and both of them use a doubling exponential algorithm. So each character is copied twice (roughly: a bit more because the final string is not a power of 2 in size, but a bit less because the intermediate strings are doubled before the string which pushes them over the edge is copied in.) — rici, Mar 14 '13 at 04:22

String concatenation complexity in C++ and Java

3 Answers3

Linked

Related