6

Possible Duplicate:
Simple string concatenation

Yesterday, as I'm writing this, someone asked on SO

if i have a string x='wow' applying the function add in python :

x='wow'
x.add(x)
'wowwow'

how can i do that in C++?

With add (which is non-existent) corrected to __add__ (a standard method) this is a deep and interesting question, involving both subtle low level details, high level algorithm complexity considerations, and even threading!, and yet it’s formulated in a very short and concise way.

I am reposting the original question as my own because I did not get a chance to provide a correct answer before it was deleted, and my efforts at having the original question revived, so that I could help to increase the general understanding of these issues, failed.

I have changed the original title “select python or C++” to …

  • What is the equivalent of CPython string concatenation, in C++?

thereby narrowing the question a little.

Community
  • 1
  • 1
Cheers and hth. - Alf
  • 142,714
  • 15
  • 209
  • 331
  • IMO, that original question should be (and in 10 minutes probably will be) undeleted. – Blender Oct 23 '12 at 00:33
  • When the original question was asked, it was along the tone of "How do I concatenate strings in C++?", with a Python example to help explain. In C++, strings are mutable. The way it was asked, I don't think mutability should matter. – chris Oct 23 '12 at 00:35
  • @Blender This, while interesting, is in no way a justification to undelete the original question. There was no mention of complexity-related, let alone multithreading-related, issues in the original question. Had the original poster been interested in the finer details, he or she would have made _basic preparations_ (such as looking up `std::string::operator+` and asking *how is that different from Python's `__add__`?*) Undeleting that question sends the wrong message. It invites people to send more of those poorly-researched, low-quality, duplicate, trivial questions. – jogojapan Oct 23 '12 at 02:17
  • @chris: you're wrong. the original question is quoted in this one. since you could hardly fail to notice that, i believe you're not entirely honest about it. – Cheers and hth. - Alf Nov 01 '12 at 20:41
  • @jogojapan: re "There was no mention of complexity-related, let alone multithreading-related, issues in the original question", if the OP had known what was relevant, then the OP would not have had to *ask*, now would he? also, answers on SO should not be limited to what one telepathically infers that the asker was thinking of. they should instead address the question asked (e.g. to help other people!). as with chris' comment there is therefore a distinct **lack of logic** in your comment. and as i read it, what with the emphasis placed, there is also an abundance of misdirection. – Cheers and hth. - Alf Nov 01 '12 at 20:45

1 Answers1

10

The general meaning of the code snippet.

The given code snippet

x = 'wow'
x.__add__( x )

has different meanings in Python 2.x and Python 3.x.

In Python 2.x the strings are by default narrow strings, with one byte per encoding unit, corresponding to C++ char based strings.

In Python 3.x the strings are wide strings, guaranteed to represent Unicode, corresponding to the in practice usage of C++ wchar_t based strings, and likewise with an unspecified 2 or 4 bytes per encoding unit.

Disregarding efficiency the __add__ method behaves the same in both main Python versions, corresponding to the C++ + operator for std::basic_string (i.e., for std::string and std::wstring), e.g. quoting the CPython 3k documentation:

object.__add__(self, other)
… to evaluate the expression x + y, where x is an instance of a class that has an __add__() method, x.__add__(y) is called.

So as an example, the CPython 2.7 code

x = 'wow'
y = x.__add__( x )
print y

would normally be written as

x = 'wow'
y = x + x
print y

and corresponds to this C++ code:

#include <iostream>
#include <string>
using namespace std;

int main()
{
    auto const x = string( "wow" );
    auto const y = x + x;
    cout << y << endl;
}

A main difference from the many incorrect answers given for the original question, is that the C++ correspondence is an expression, and not an update.

It is perhaps natural to think that the method name __add__ signifies a change of the string object’s value, an update, but with respect to observable behavior Python strings are immutable strings. Their values never change, as far as can be directly observed in Python code. This is the same as in Java and C#, but very unlike C++’s mutable std::basic_string strings.

The quadratic to linear time optimization in CPython.

CPython 2.4 added the following optimization, for narrow strings only:

String concatenations in statements of the form s = s + "abc" and s += "abc" are now performed more efficiently in certain circumstances. This optimization won't be present in other Python implementations such as Jython, so you shouldn't rely on it; using the join() method of strings is still recommended when you want to efficiently glue a large number of strings together. (Contributed by Armin Rigo.)

It may not sound like much, but where it’s applicable this optimization reduces a sequence of concatenations from quadratic time O(n2) to linear time O(n), in the length n of the final result.

First of all the optimization replaces concatenations with updates, e.g. as if

x = x + a
x = x + b
x = x + c

or for that matter

x = x + a + b + c

was replaced with

x += a
x += b
x += c

In the general case there will be many references to the string object that x refers to, and since Python string objects must appear to be immutable, the first update assignment cannot change that string object. It therefore, generally, has to create a completely new string object, and assign that (reference) to x.

At this point x holds the only reference to that object. Which means that the object can be updated by the update assignment that appends b, because there are no observers. And likewise for the append of c.

It’s a bit like quantum mechanics: you cannot observe this dirty thing going on, and it’s never done when there is the possibility of anyone observing the machinations, but you can infer that it must have been going on by the statistics that you collect about performance, for linear time is quite different from quadratic time!

How is linear time achieved? Well, with the update the same strategy of buffer doubling as in C++ std::basic_string can be done, which means that the existing buffer contents only need to be copied at each buffer reallocation, and not for each append operation. Which means that the total cost of copying is at worst linear in the final string size, in the same way as the sum (representing the costs of copying at each buffer doubling) 1 + 2 + 4 + 8 + … + N is less than 2*N.

Linear time string concatenation expressions in C++.

In order to faithfully reproduce the CPython code snippet in C++,

  • the final result and expression-nature of the operation be should captured,

  • and also its performance characteristics should be captured!

A direct translation of CPython __add__ to C++ std::basic_string + fails to reliably capture the CPython linear time. The C++ + string concatenation may be optimized by the compiler in the same way as the CPython optimization. Or not – which then means that one has told a beginner that the C++ equivalent of a Python linear time operation, is something with quadratic time – hey, this is what you should use…

For the performance characteristics C++ += is the basic answer, but, this does not catch the expression nature of the Python code.

The natural answer is a linear time C++ string builder class that translates a concatenation expression to a series of += updates, so that the Python code

from __future__ import print_function

def foo( s ):
    print( s )

a = 'alpha'
b = 'beta'
c = 'charlie'
foo( a + b + c )    # Expr-like linear time string building.

correspond to roughly

#include <string>
#include <sstream>

namespace my {
    using std::string;
    using std::ostringstream;

    template< class Type >
    string stringFrom( Type const& v )
    {
        ostringstream stream;
        stream << v;
        return stream.str();
    }

    class StringBuilder
    {
    private:
        string      s_;

        template< class Type >
        static string fastStringFrom( Type const& v )
        {
            return stringFrom( v );
        }

        static string const& fastStringFrom( string const& s )
        { return s; }

        static char const* fastStringFrom( char const* const s )
        { return s; }

    public:
        template< class Type >
        StringBuilder& operator<<( Type const& v )
        {
            s_ += fastStringFrom( v );
            return *this;
        }

        string const& str() const { return s_; }
        char const* cStr() const { return s_.c_str(); }

        operator string const& () const { return str(); }
        operator char const* () const { return cStr(); }
    };
}  // namespace my

#include <iostream>
using namespace std;
typedef my::StringBuilder S;

void foo( string const& s )
{
    cout << s << endl;
}

int main()
{
    string const    a   = "alpha";
    string const    b   = "beta";
    string const    c   = "charlie";

    foo( S() << a << b << c );      // Expr-like linear time string building.
}
Community
  • 1
  • 1
Cheers and hth. - Alf
  • 142,714
  • 15
  • 209
  • 331
  • 1
    I believe it would be a bit more efficient to maintain a `std::stringstream` in the builder (that is what I have in my implementation). While the asymptotic complexity would be the same, the number of memory allocations will be lower (when appending anything other than strings). For example, for an `int` you need a dynamic allocation in the `std::stringstream` inside `stringFrom`, plus an additional allocation for the returned `std::string` (this one can be elided) before the copy to the destination string. – David Rodríguez - dribeas Oct 23 '12 at 02:00
  • @DavidRodríguez-dribeas: on the contrary, for efficiency you should use the string, plus non-stream based conversions via `fastStringFrom` specializations, or the general `stringFrom`. I don't know exactly why. But the streams generally do not perform very well (I only use them for clarity above). On an unrelated note, there are several other approaches to linear time string concatenation expressions in C++. I didn't feel I had the time or familiarity (to be honest) to write about those. I just put down what I generally use. It's like the simplest approach. – Cheers and hth. - Alf Oct 23 '12 at 02:27
  • I just run a quick comparison by concatenating 1M numbers (the range [0..999999]). Compiled with `g++ -O3` there is a factor of 10 difference (average over .93s for your implementation, .087 by storing a `std::ostringstream` directly). You are right in that you should avoid going through the `stringstream` if you are only working with strings, which I would have understood (i.e. plain `std::string` concat) but going generic (any input type) you must decide one way or the other depending on the expected inputs. – David Rodríguez - dribeas Oct 23 '12 at 03:03
  • ... A different issue in the proposed implementation is the double conversion to both `const std::string&` and `const char*`, which will cause ambiguities in many contexts `void f(std::string); f( S() << "a" );` and potential issues (I considered offering the `const char*` conversion, but decided against it to avoid potential lifetime issues. (BTW, the same test as before with only `std::string` of small size [actually just "Hi"'s] objects falls to your side by a factor of 1:2) – David Rodríguez - dribeas Oct 23 '12 at 03:10
  • @David: if you want to discuss efficiency then I suggest a new SO question. the above is code to teach the principle. in particular you don't want to use the streams when you aim for efficiency, but you do want to use them when you aim for exposition clarity. – Cheers and hth. - Alf Oct 23 '12 at 10:30
  • @David: I fail to reproduce the ambiguity you mention with Visual C++11 and g++ 4.7.2. there may still be some ambiguity of the kind you envision, i dunno. anyway, please ask a separate SO question about that, yes? – Cheers and hth. - Alf Oct 23 '12 at 10:34