3

I'm working with long std::string(std::wstring) processing

The length of original string can be 10 to 1 million, and I got a number of substring offsets. what I need is to concatenate of multiple substrings of original string and with some new strings.

Except using using string append

auto str = new std::string();
str.append(original.substr(a,b)).append(newstr1).append(original.substr(c,d)).append.....

Is there any more efficient way like handling pointer or string iterators?

Thanks.

UPDATE:

I got several feedbacks now, except for rope I can test all the other methods The result is following:

#include <string>
#include <iostream>
#include <chrono>
#include <ctime>
std::string GetSystemTimeEpoch(){
    using namespace std::chrono;
    auto now = system_clock::now();
    time_point<system_clock> epoch;
    microseconds ms = duration_cast<milliseconds>(now - epoch);
    double epoch_time = (unsigned long long)ms.count() / 1000000.0;
    unsigned long long postfix = (unsigned long long)ms.count() % 1000000;
    std::time_t   time = static_cast<time_t>(epoch_time);
    std::tm tm = *std::localtime(&time);
    char Buf[80];
    std::strftime(Buf, sizeof(Buf), "%Y-%m-%dT%H:%M:%S", &tm);
    std::string finaltime(Buf);
    return finaltime.append(".").append(std::to_string(postfix));
}

#define TESTLENGTH1 1000000000
#define TESTLENGTH2 300000000
int main(){
    std::string Str(TESTLENGTH2, 'c');
    std::cout << GetSystemTimeEpoch() << " Begin of Method 1(replace)"<< std::endl;
    for (size_t i = 0; i < Str.length(); i++){
        Str.replace(i, 1, "d");
    }
    std::cout << GetSystemTimeEpoch() << " Begin of Method 2(append)" << std::endl;
    std::string NewStr1;
    for (size_t i = 0; i < Str.length(); i++){
        NewStr1.append(Str.substr(i, 1));
    }
    std::cout << GetSystemTimeEpoch() << " Begin of Method 3(+=)" << std::endl;
    std::string NewStr2;
    for (size_t i = 0; i < Str.length(); i++){
        NewStr2 += Str.substr(i, 1);
    }
    std::cout << GetSystemTimeEpoch() << " Begin of Method 4(reserve)" << std::endl;
    std::string NewStr3;
    NewStr3.reserve(TESTLENGTH2);
    for (size_t i = 0; i < Str.length(); i++){
        NewStr3 += Str.substr(i, 1);
    }
    std::cout << GetSystemTimeEpoch() << " End" << std::endl;
    return 0;
}

===

2016-05-21T22:38:51.471000 Begin of Method 1(replace)
2016-05-21T22:38:58.972000 Begin of Method 2(append)
2016-05-21T22:39:14.429000 Begin of Method 3(+=)
2016-05-21T22:39:29.944000 Begin of Method 4(reserve)
2016-05-21T22:39:44.892000 End
Press any key to continue . . .

It seems the quickest way is not doing concatenation but do replace instead for the concatenation.(method1)

Concatenation methods(2,3,4) there seems no difference.

I have not tested sgi ROPE class since I could not find a beginners document to start with :). If someone know about it, please leave a sketch or complete this testcase.

PS. TESTLENGTH1 crashed for method 2 and 3 and 4

PS2. Testing environment, Win7x64;VC++2013;Target Win32,Release. i5 2GHz,8GB RAM

Boying
  • 1,404
  • 13
  • 20

3 Answers3

2

The original STL from SGI had a data-structure called a rope. This stored an array of subsequences, so the construction of your new sequence would be O(1).

See this answer. You can download the SGI STL from here.

Community
  • 1
  • 1
2

Regarding the tests: You should use a profiling tool or write a proper test case measuring execution time:

#include <string>
#include <iostream>
#include <chrono>
#include <random>

using std::chrono::system_clock;

#define TESTLENGTH 100000000

std::string random_string() {
    std::random_device random_device;
    std::mt19937 random_generator(random_device());
    std::uniform_int_distribution<char>distribution;
    std::string result(TESTLENGTH, 0);
    for(auto& c :result)
        c = distribution(random_generator);
    return result;
}

void print_duration(const system_clock::time_point& start, const system_clock::time_point& stop) {
    using namespace std::chrono;
    auto duration = duration_cast<milliseconds>(stop - start);
    std::cout << duration.count() << std::endl;
}

void utilize(const std::string& str)
{
    static volatile char* result = new char[TESTLENGTH];;
    std::copy(str.begin(), str.begin() + std::max(str.size(), std::string::size_type(TESTLENGTH)), result);
}

int main(){
    for(unsigned loop = 0; loop < 4; ++loop) {
        std::cout << "Method 1(replace): "<< std::endl;
        {
            std::string  Str = random_string();
            auto start = system_clock::now();
            std::string NewStr(TESTLENGTH, 0);
            for (size_t i = 0; i < Str.length(); i++){
                NewStr.replace(i, 1, 1, Str[i]);
            }
            auto stop = system_clock::now();
            print_duration(start, stop);
            utilize(NewStr);
        }
        std::cout << "Method 2(append)" << std::endl;;
        {
            std::string  Str = random_string();
            auto start = system_clock::now();
            std::string NewStr;
            for (size_t i = 0; i < Str.length(); i++){
                NewStr.append(1, Str[i]);
            }
            auto stop = system_clock::now();
            print_duration(start, stop);
            utilize(NewStr);
        }
        std::cout << "Method 3(+=)" << std::endl;
        {
            std::string  Str = random_string();
            auto start = system_clock::now();
            std::string NewStr;
            for (size_t i = 0; i < Str.length(); i++){
                NewStr += Str[i];
            }
            auto stop = system_clock::now();
            print_duration(start, stop);
            utilize(NewStr);
        }
        std::cout << "Method 4(reserve)" << std::endl;
        {
            std::string  Str = random_string();
            auto start = system_clock::now();
            std::string NewStr;
            NewStr.reserve(TESTLENGTH);
            for (size_t i = 0; i < Str.length(); i++){
                NewStr += Str[i];
            }
            auto stop = system_clock::now();
            print_duration(start, stop);
            utilize(NewStr);
        }
    }
    return 0;
}

Notes:

The code is not reflecting the original question (only single and no character sequences are appended to the result string), but it is an improvement of the code shown in the question.

I made the replace approach comparable to the other.

To prevent needless overhead the time measurement is kept to a minimum (is not using the dreadful GetSystemTimeEpoch)

To avoid unwanted overhead, I dropped std::string::substr.

To prevent unwanted compiler optimizations:

  • The input is randomized
  • The result is utilized (by copying it to a volatile address)

To get a more reliable result the measurement is executed multiple times (maybe it should be more than 4).

Results:

Having g++ 4.8.4 with g++ -std=c++11 -O3 my measurement is:

Method 1(replace): 
1766
Method 2(append)
1292
Method 3(+=)
684
Method 4(reserve)
628
Method 1(replace): 
1766
Method 2(append)
1275
Method 3(+=)
678
Method 4(reserve)
572
Method 1(replace): 
1768
Method 2(append)
1276
Method 3(+=)
678
Method 4(reserve)
559
Method 1(replace): 
1767
Method 2(append)
1276
Method 3(+=)
682
Method 4(reserve)
579

Replacing append by push_back leads to the same performance as using operator +=.

  • Hi, seems I got different result: //std::uniform_int_distribution <> distribution(32,97); Method 1(replace): 1346 Method 2(append) 845 Method 3(+=) 1292 Method 4(reserve) 1263 – Boying May 22 '16 at 02:52
  • /GS /GL /analyze- /W3 /Gy /Zc:wchar_t /Zi /Gm- /Ox /sdl /Fd"Release\vc120.pdb" /fp:precise /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_LIB" /D "_UNICODE" /D "UNICODE" /errorReport:prompt /WX- /Zc:forScope /Gd /Oy- /Oi /MD /Fa"Release\" /EHsc /nologo /Fo"Release\" – Boying May 22 '16 at 13:47
  • the default max optimization(/O2) seems no difference with full optimization(/Ox). When I got chance later, I'll run you code in linux and mac to check the difference. – Boying May 22 '16 at 13:48
1

The First method is to use Martin Bonner's method. I am not really sure, but it is worth a try.


The Second method is to use the operator +(or +=).
It will make your code shorter (and maybe even faster).


and also, you said the length could be 10 to 1 million. then here is a good news! according to string::max_size, the max length of a string is almost 429 million, so that is not a thing you should worry about.

bolov
  • 72,283
  • 15
  • 145
  • 224
  • 1
    += shorts code, but seems the same speed:) ,see update – Boying May 21 '16 at 15:09
  • 1
    @Matthew Roh I think you meant 4.29 billions, i.e. 2^32-1, which is the size of a 32-bit address space. On 64-bit systems, which have more than 4GB of address space, max_size is far, far higher, to the point it's meaningless because it doesn't take the available physical RAM into account. For example, this is what I get on a linux system : 9 223 372 036 854 775 807 (i.e. 2^63-1). Of course I don't have enough RAM + swap to hold that much data. – Johan Boulé May 21 '16 at 20:04