2

Disclaimer: This problem is more of a theoretical, rather than a practical interest. I want to find out various different ways of doing this, with speed as icing on the new year cake.

The Problem

I want to be able to store a list of strings, and be able to quickly combine them into 1 if needed.
In short, I want to condense a structure (currently a std::vector<std::string>) that looks like

["Hello, ", "good ", "day ", " to", " you!"]

to

["Hello, good day to you!"]
  • Is there any idiomatic way to achieve this, ala python's [ ''.join(list_of_strings) ]?
  • What is the best way to achieve this in C++, in terms of time?

Possible Approaches

The first idea I had is to

  • loop over the vector,
  • append each element to the first,
  • simultaneously delete the element.

We will be concatenating with += and reserve(). I assume that max_size() will not be reached.

Approach 1 (The Greedy Approach)

So called because it ignores conventions and operates in-place.

#if APPROACH == 'G'

    // Greedy Approach
    void condense(std::vector< std::string >& my_strings, int total_characters_in_list)
    {
        // Reserve the size for all characters, less than max_size()
        my_strings[0].reserve(total_characters_in_list);

        // There are strings left, ...
        for(auto itr = my_strings.begin()+1; itr != my_strings.end();)
        {
            // append, and...
            my_strings[0] += *itr;
            // delete, until...
            itr = my_strings.erase(itr);
        }
    }

#endif

Now I know, you would say that this is risky and bad. So:

  • loop over the vector,
  • append each element to another std::string,
  • clear the vector and make the string first element of the vector.

Approach 2 (The "Safe" Haven)

So called because it does not modify the container while iterating over it.

#if APPROACH == 'H'

    // Safe Haven Approach
    void condense(std::vector< std::string >& my_strings, int total_characters_in_list)
    {
        // Store the whole vector here
        std::string condensed_string;
        condensed_string.reserve(total_characters_in_list);

        // There are strings left...
        for(auto itr = my_strings.begin(); itr != my_strings.end(); ++itr)
        {
            // append, until...
            condensed_string += *itr;
        }
        // remove all elements except the first
        my_strings.resize(1);
        // and set it to condensed_string
        my_strings[0] = condensed_string;
    }

#endif

Now for the standard algorithms...

Using std::accumulate from <algorithm>

Approach 3 (The Idiom?)

So called simply because it is a one-liner.

#if APPROACH == 'A'

    // Accumulate Approach
    void condense(std::vector< std::string >& my_strings, int total_characters_in_list)
    {
        // Reserve the size for all characters, less than max_size()
        my_strings[0].reserve(total_characters_in_list);

        // Accumulate all the strings
        my_strings[0] = std::accumulate(my_strings.begin(), my_strings.end(), std::string(""));
        // And resize
        my_strings.resize(1);
    }

#endif

Why not try to store it all in a stream?

Using std::stringstream from <sstream>.

Approach 4 (Stream of Strings)

So called due to the analogy of C++'s streams with flow of water.

#if APPROACH == 'S'

    // Stringstream Approach
    void condense(std::vector< std::string >& my_strings, int) // you can remove the int
    {
        // Create out stream
        std::stringstream buffer(my_strings[0]);

        // There are strings left, ...
        for(auto itr = my_strings.begin(); itr != my_strings.end(); ++itr)
        {
            // add until...
            buffer << *itr;
        }

        // resize and assign
        my_strings.resize(1);
        my_strings[0] = buffer.str();
    }

#endif

However, maybe we can use another container rather than std::vector?

In that case, what else?

(Possible) Approach 5 (The Great Indian "Rope" Trick)

I have heard about the rope data structure, but have no idea if (and how) it can be used here.


Benchmark and Verdict:

Ordered by their time efficiency (currently and surprisingly) is1:

Approaches          Vector Size: 40     Vector Size: 1600   Vector Size: 64000

SAFE_HAVEN:         0.1307962699997006  0.12057728999934625 0.14202970000042114 
STREAM_OF_STRINGS:  0.12656566000077873 0.12249500000034459 0.14765803999907803
ACCUMULATE_WEALTH:  0.11375975999981165 0.12984520999889354 3.748660090001067   
GREEDY_APPROACH:    0.12164988000004087 0.13558526000124402 22.6994204800023    

timed with2:

NUM_OF_ITERATIONS = 100
test_cases = [ 'greedy_approach', 'safe_haven' ]
for approach in test_cases:
    time_taken = timeit.timeit(
        f'system("{approach + ".exe"}")',
        'from os import system',
        number = NUM_OF_ITERATIONS
    )
    print(approach + ": ", time_taken / NUM_OF_ITERATIONS)

Can we do better?

Update: I tested it with 4 approaches (so far), as I could manage in my little time. More incoming soon. It would have been better to fold the code, so that more approaches could be added to this post, but it was declined.

1 Note that these readings are only for a rough estimate. There are a lot of things that influence the execution time, and note that there are some inconsistencies here as well.

2 This is the old code, used to test only the first two approaches. The current code is a good deal longer, and more integrated, so I am not sure I should add it here.

Conclusions:

  • Deleting elements is very costly.
  • You should just copy the strings somewhere, and resize the vector.
  • Infact, better reserve enough space too, if copying to another string.
Jaideep Shekhar
  • 808
  • 2
  • 7
  • 21
  • I would always use the std algorithm approach like described in the answer to [Combining a vector of strings](https://stackoverflow.com/questions/1985978/combining-a-vector-of-strings). And only if that is really a bottleneck I would try to optimize it. But I don't think that you can do much about performance there, so I would go with the most readable one. – t.niese Jan 01 '20 at 10:15
  • @t.niese Interesting. Let me benchmark it. Any other ideas would also be appreciated. – Jaideep Shekhar Jan 01 '20 at 10:17
  • You always need to take care when you do create benchmarks. They do not necessarily reflect how those approaches would perform in a real application. Benchmarking is most of the time only meaningful if you do it in your real application, and even then there are so many side effects, influencing the result, as those are not easy to interpret correctly. – t.niese Jan 01 '20 at 10:34
  • @t.niese The readings are coming quite close, and the fluctuation is too great. I am trying again with a bigger (more real life) load. Thanks. – Jaideep Shekhar Jan 01 '20 at 10:49
  • That data size is for sure a factor. But the context in which it is executed also plays an important role. – t.niese Jan 01 '20 at 10:53
  • @t.niese "But the context in which it is executed..." which may be? – Jaideep Shekhar Jan 01 '20 at 11:08
  • 1
    Environment variables, linked libraries, the path from which it is launched, ... all of them have an influence on the memory layout and the caching and therefore can have a great effect on the performance outcome. So not only this could influence those benchmarks, but also in the real use case. Here is a whole talk about that: ["Performance Matters" by Emery Berger](https://www.youtube.com/watch?v=r-TLSBdHe1A) – t.niese Jan 01 '20 at 11:13
  • 2
    Your use of [`std::vector::erase`](https://en.cppreference.com/w/cpp/container/vector/erase) invokes undefined behaviour. Assign its return value to `itr` to fix this. – Daniel Jour Jan 01 '20 at 13:17
  • @DanielJour noted. Thanks! – Jaideep Shekhar Jan 03 '20 at 05:22
  • This seems to be a duplicate of [this](https://stackoverflow.com/q/19664196). – JHBonarius Sep 09 '20 at 12:50

4 Answers4

2

You could also try std::accumulate:

auto s = std::accumulate(my_strings.begin(), my_strings.end(), std::string());

Won't be any faster, but at least it's more compact.

1

With range-v3 (and soon with C++20 ranges), you might do:

std::vector<std::string> v{"Hello, ", "good ", "day ", " to", " you!"};
std::string s = v | ranges::view::join;

Demo

Jarod42
  • 203,559
  • 14
  • 181
  • 302
0

By default, I would use std::stringstream. Simply construct the steam, stream in all the strings from the vector, and then return the output string. It isn't very efficient but it is clear what it does.

In most cases, one doesn't need fast method when dealing with strings and printing - so the "easy to understand and safe" methods are better. Plus, compilers nowadays are good at optimizing inefficiencies in simple cases.

The most efficient way... it is a hard question. Some applications require efficiency on multiple fronts. In these cases you might need to utilize multithreading.

ALX23z
  • 4,456
  • 1
  • 11
  • 18
0

Personally, I'd construct a second vector to hold a single "condensed" string, construct the condensed string, and then swap vectors when done.

 void Condense(std::vector<std::string> &strings)
 {
      std::vector<std::string> condensed(1);          // one default constructed std::string
      std::string &constr = &condensed.begin();      // reference to first element of condensed
      for (const auto &str : strings)
         constr.append(str);

      std::swap(strings, condensed);                //  swap newly constructed vector into original
 }

If an exception is thrown for some reason, then the original vector is left unchanged, and cleanup occurs - i.e. this function gives a strong exception guarantee.

Optionally, to reduce resizing of the "condensed" string, after initialising constr in the above, one could do

       //  optional:  compute the length of the condensed string and reserve
      std::size_t total_characters_in_list = 0;
      for (const auto &str : strings)
          total_characters_in_list += str.size();
      constr.reserve(total_characters_in_list);
       //   end optional reservation

As to how efficient this is compared with alternatives, that depends. I'm also not sure it's relevant - if strings keep on being appended to the vector, and needing to be appended, there is a fair chance that the code that obtains the strings from somewhere (and appends them to the vector) will have a greater impact on program performance than the act of condensing them.

Peter
  • 35,646
  • 4
  • 32
  • 74
  • "there is a fair chance that the code that obtains the strings from somewhere (and appends them to the vector) will have a greater impact on program performance than the act of condensing them..." and that is what I am going to test as well. But please expect no response until tomorrow. – Jaideep Shekhar Jan 01 '20 at 11:54