Strange algorithm performance

Question

For context, I wrote this algorithm to get the number of unique substrings of any string. It builds the suffix tree for the string counting the nodes it contains and returns that as the answer. The problem I wanted to solve required a O(n) algorithm so this question is only about how this code behaves and not about how bad it is at what it does.

struct node{
    char value = ' ';
    vector<node*> children;
    ~node()
    {
        for (node* child: children)
        {
            delete child;
        }
    }
};

int numberOfUniqueSubstrings(string aString, node*& root)
{
    root = new node();
    int substrings = 0;
    for (int i = 0; i < aString.size(); ++i)
    {
        string tmp = aString.substr(i, aString.size());
        node* currentNode = root;
        char indexToNext = 0;
        for (int j = 0; j < currentNode->children.size(); ++j)
        {
            if (currentNode->children[j]->value == tmp[indexToNext])
            {
                currentNode = currentNode->children[j];
                j = -1;
                indexToNext++;
            }
        }
        for (int j = indexToNext; j < tmp.size(); ++j)
        {
            node* theNewNode = new node;
            theNewNode->value = tmp[j];
            currentNode->children.push_back(theNewNode);
            currentNode = theNewNode;
            substrings++;
        }
    }
    return substrings;
}

I decided to benchmark this algorithm for which I simply looped over a big string taking a bigger substring each iteration, calling numberOfUniqueSusbstrings measuring how long it took to end.

I plotted it in octave and this is what I got (x is string size and y is time in microseconds)

I first thought the problem lied in the input string but it's just an alphanumeric string I got from a book (any other text behaves just as strangely).

Also tried averaging many calls to the function with the same parameter and the result is pretty much the same.

This is compiling with g++ problem.cpp -std=c++14 -O3 but seems to do the same on -O2 and -O0.

Edit: After @interjay 's answer I've tried doing just that which leaves the function as:

int numberOfUniqueSubstrings(string aString, node*& root)
{
    root = new node();
    int substrings = 0;
    for (int i = 0; i < aString.size(); ++i)
    {
        node* currentNode = root;
        char indexToNext = i;
        for (int j = 0; j < currentNode->children.size(); ++j)
        {
            if (currentNode->children[j]->value == aString[indexToNext])
            {
                currentNode = currentNode->children[j];
                j = -1;
                indexToNext++;
            }
        }
        for (int j = indexToNext; j < aString.size(); ++j)
        {
            node* theNewNode = new node;
            theNewNode->value = aString[j];
            currentNode->children.push_back(theNewNode);
            currentNode = theNewNode;
            substrings++;
        }
    }
    return substrings;
}

And it does indeed make it a bit faster. But not less strange for I plotted this:

Something is happening at x = 1000 and I have no clue what it could be.

Another plot for good measure:

I've now run gprof for a string of size 999:

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  us/call  us/call  name    
100.15      0.02     0.02      974    20.56    20.56  node::~node()
  0.00      0.02     0.00   498688     0.00     0.00  void std::vector<node*, std::allocator<node*> >::_M_emplace_back_aux<node* const&>(node* const&)
  0.00      0.02     0.00        1     0.00     0.00  _GLOBAL__sub_I__Z7imprimePK4node
  0.00      0.02     0.00        1     0.00     0.00  numberOfUniqueSubstrings(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, node*&)
^L
            Call graph


granularity: each sample hit covers 2 byte(s) for 49.93% of 0.02 seconds

index % time    self  children    called     name
                               54285             node::~node() [1]
                0.02    0.00     974/974         test(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) [2]
[1]    100.0    0.02    0.00     974+54285   node::~node() [1]
                               54285             node::~node() [1]
-----------------------------------------------
                                                 <spontaneous>
[2]    100.0    0.00    0.02                 test(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) [2]
                0.02    0.00     974/974         node::~node() [1]
                0.00    0.00       1/1           numberOfUniqueSubstrings(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, node*&) [12]
-----------------------------------------------
                0.00    0.00  498688/498688      numberOfUniqueSubstrings(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, node*&) [12]
[10]     0.0    0.00    0.00  498688         void std::vector<node*, std::allocator<node*> >::_M_emplace_back_aux<node* const&>(node* const&) [10]
-----------------------------------------------
                0.00    0.00       1/1           __libc_csu_init [21]
[11]     0.0    0.00    0.00       1         _GLOBAL__sub_I__Z7imprimePK4node [11]
-----------------------------------------------
                0.00    0.00       1/1           test(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) [2]
[12]     0.0    0.00    0.00       1         numberOfUniqueSubstrings(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, node*&) [12]
                0.00    0.00  498688/498688      void std::vector<node*, std::allocator<node*> >::_M_emplace_back_aux<node* const&>(node* const&) [10]
-----------------------------------------------

And for a string of size 1001:

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  us/call  us/call  name    
100.15      0.02     0.02      974    20.56    20.56  node::~node()
  0.00      0.02     0.00   498688     0.00     0.00  void std::vector<node*, std::allocator<node*> >::_M_emplace_back_aux<node* const&>(node* const&)
  0.00      0.02     0.00        1     0.00     0.00  _GLOBAL__sub_I__Z7imprimePK4node
  0.00      0.02     0.00        1     0.00     0.00  numberOfUniqueSubstrings(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, node*&)


            Call graph


granularity: each sample hit covers 2 byte(s) for 49.93% of 0.02 seconds

index % time    self  children    called     name
                               54285             node::~node() [1]
                0.02    0.00     974/974         test(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) [2]
[1]    100.0    0.02    0.00     974+54285   node::~node() [1]
                               54285             node::~node() [1]
-----------------------------------------------
                                                 <spontaneous>
[2]    100.0    0.00    0.02                 test(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) [2]
                0.02    0.00     974/974         node::~node() [1]
                0.00    0.00       1/1           numberOfUniqueSubstrings(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, node*&) [12]
-----------------------------------------------
                0.00    0.00  498688/498688      numberOfUniqueSubstrings(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, node*&) [12]
[10]     0.0    0.00    0.00  498688         void std::vector<node*, std::allocator<node*> >::_M_emplace_back_aux<node* const&>(node* const&) [10]
-----------------------------------------------
                0.00    0.00       1/1           __libc_csu_init [21]
[11]     0.0    0.00    0.00       1         _GLOBAL__sub_I__Z7imprimePK4node [11]
-----------------------------------------------
                0.00    0.00       1/1           test(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) [2]
[12]     0.0    0.00    0.00       1         numberOfUniqueSubstrings(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, node*&) [12]
                0.00    0.00  498688/498688      void std::vector<node*, std::allocator<node*> >::_M_emplace_back_aux<node* const&>(node* const&) [10]
-----------------------------------------------


Index by function name

  [11] _GLOBAL__sub_I__Z7imprimePK4node [1] node::~node()
  [12] numberOfUniqueSubstrings(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, node*&) [10] void std::vector<node*, std::allocator<node*> >::_M_emplace_back_aux<node* const&>(node* const&)

However it seems that running the profiler removes the effect and the times are pretty much the same in both cases.

Also try to do currentNode->children.reserve() since you already know the size of this vector after for loop. — vasi1y, Jan 22 '17 at 19:31
You also use `vector`. It based on the array structure if I don't mistake, so it will take time to increase an array capacity. It will influence on the performance too. — Denis Sologub, Jan 22 '17 at 19:31
@Шах std::vector takes amortized constant time for its `push_back` operation, so it won't affect the time complexity here. — interjay, Jan 22 '17 at 19:37
Your chart has a big dispersion at start and a very small after 1000. It looks like some of standart classes (`vector` or `string`) uses another algorithm if it contains elements more 1000. — Denis Sologub, Jan 22 '17 at 19:38
@interjay, the amortized constant time is a statistic value, not the determined (I mean it will be constant for whole execution time), so in some moment of time it can have different execution times and will give waves on the chart. — Denis Sologub, Jan 22 '17 at 19:49
@Шах You're right that it might be the cause of the waves (although I don't see a reason for a performance spike to disappear when you keep increasing n). I thought you were talking about the asymptotic performance in your previous comment, which is still guaranteed to be O(n) for n operations. — interjay, Jan 22 '17 at 19:54
What happens if you go over `2000` characters? I suspect it might have something to do with `vector` resizing itself. Let's see what happens at `x = 2000` and `x = 4000`. — IVlad, Jan 22 '17 at 20:17
@IVlad edited one in there. It looks very `n log(n)`ic right? — mjgalindo, Jan 22 '17 at 20:20
It does. I was wrong about it being about the vector resize. I still suspect what Шах mentioned -- different algorithm for small `n` for some std structure -- but I don't know what it could be. — IVlad, Jan 22 '17 at 20:23
you need to add the c++11 flag, since your code uses c++11 extensions — user31264, Jan 22 '17 at 20:25
What operating system and standard library are you on? Is the heap changing behavior somehow after 1k allocations or something? — user541686, Jan 22 '17 at 21:10
just for remark, since new nodes are stored in a vector, there is no need to keep them as pointers. vector's emplace_back would work fine. — icaptan, Jan 22 '17 at 22:01
@user31264: That's what I thought, so I blew up the picture and looked as closely as I could. And yes, it really is 1000, not 1024. OK, it might be 1000 +/- 4 or 5, but not 1024. — TonyK, Jan 22 '17 at 22:12
"Something is happening at x = 1000" -- I may be seeing things, but is not *something* also happening at x = 256? These feel like library internal magically-numbered buffers to me. — LSerni, Jan 22 '17 at 22:25
Mhhh, maybe this is related with the .size() you are performing on each loop, I'm not sure how well it get optimized by g++. See this question on the matter: https://stackoverflow.com/questions/1242185/loop-condition-evaluation and also this : https://stackoverflow.com/questions/5135291/is-using-string-length-in-loop-efficient — Lery, Jan 22 '17 at 23:35
To get a clearer view, you may also want to setup a new test based on randomized inputs seeded by the current time, so it is dynamic, to be sure this is not something which is input-dependant or related to the way you provide it. — Lery, Jan 22 '17 at 23:47
You have a typo: `char indexToNext = i;` should be `int indexToNext = i;`. Once `indexToNext` reaches `128`, the function starts accessing negative positions in `aString`. — Anton, Jan 23 '17 at 01:36

score 11 · Accepted Answer · edited May 23 '17 at 12:25

Most people's working hypothesis seems to be that there's some kind of magic number hard coded into the libraries that results in a phase transition in performance around 999-1000 (except for LSerni, who makes the prescient observation that there may be multiple magic numbers).

I'll try to systematically explore this and a few other hypotheses below (source code is available at the end of this answer).

I then ran my code to see if I could duplicate your results on my Intel(R) Core(TM) i5 CPU M480, Linux 4.8.0-34-generic machine, using G++ 6.2.0-5ubuntu2 as my compiler with -O3 optimizations.

Sure enough, there's a magic drop from 999-1000 (and, another one near 1600):

Note that my trans-1000 dataset is not as clean as yours: this may be because I'm playing with a few other things in the background on my machine whereas you had a quieter testing environment.

My next question was: is this magic 1000 number stable between environments?

So I tried running the code on an Intel(R) Xeon(R) CPU E5-2680 v3, Linux 2.6.32-642.6.1.el6.x86_64 machine, using G++ 4.9.2. And, not unsurprisingly, the magic number was different, occurring at 975-976:

This tells us that, if there was a magic number, it's changed between versions. This decreases my confidence in the magic number theory for a few reasons. (a) It changes. (b) 1000+24 bytes of overhead is a good candidate for magic. 975+49 bytes is less so. (c) The first environment has better software on a slower processor, yet the first environment shows what I would consider worse performance: waiting until 1000 to speed things up. This seems like a regression.

I tried a different test: running the program with different random input data. That gives this result:

The salient point in the above graph is that the 999-1000 drop is not so special. It looks like many of the drops before it: a slow decrease in speed followed by a sharp improvement. It's also worth noting that many of the previous drops do not align.

This suggested to me that this is an input-dependent behavior and that there's correlation between runs. Therefore, I wondered what would happen if I reduced the correlation between runs by randomizing their order. This gave:

Something's still happening around 999-1000:

Let's zoom in even more:

Running this on the faster computer with the older software gives a similar result:

Zoomed:

Since randomizing the order in which strings of different lengths are considered essentially eliminated the slow build-up between runs (the aforementioned correlation), this suggests that the phenomenon you're seeing requires some sort of global state. Therefore, C++ string/vector cannot be an explanation. Therefore, malloc, "the OS", or architectural constraints must be the explanation.

Note that when the order of lengths is randomized, there's a point at which the code runs slower rather than faster. In my mind this is consistent with some sort of cache size being exceeded, but the noise in the signal coupled with the very first plot in this post also suggest possible memory fragmentation. Therefore, I decided to restart the program before each run to ensure a fresh heap. That resulted in the following:

And now we see that there are no more breaks or jumps. This suggests that cache-size was not the issue, but, rather, that the observed behavior has something to do with the program's overall memory usage.

Another argument against a caching effect is as follows. Both machines have 32kB and 256kB L1 and L2 caches, so their cache performance should be similar. My slow machine has a 3,072kB L3 cache. If you assume a 4kB page per allocation, 1000 nodes gives 4,000kB allocated, which is close to the cache size. However, the fast machine has a 30,720kB L3 cache and shows a break at 975. If the phenomenon were a caching effect, you would expect the break, if anything, to come later. Therefore, I'm pretty sure caching is not at work here.

The only remaining culprit is malloc.

Why is this happening? I'm not sure. But, as a programmer, I don't care, as follows.

There's probably an explanation for this, but it's at a level that's too deep to change or really worry about. I could do something exotic to fix it, but that would require thinking about what's going on somewhere in its dark underbelly. We use higher-level languages like C++ specifically to avoid messing with those kinds of details unless we really have to.

And my results say we don't have to in this case. (a) The last graph tells us that any independent run of the code is likely to exhibit near-optimal behavior, (b) randomizing sequential runs can level performance, and (c) the loss in efficiency is on the order of a hundredth of a second, which is entirely acceptable unless you're processing massive amounts of data.

Source code follows. Note that the code changes your version's char indexToNext to int indexToNext, fixing possible integer overflow issues. Testing interjay's suggestion that we avoid making copies of the string actually resulted in worse performance.

#include <string>
#include <chrono>
#include <cstdlib>
#include <iostream>
#include <vector>
#include <time.h>
#include <algorithm>

struct profiler
{
  std::string name;
  std::chrono::high_resolution_clock::time_point p;
  profiler(std::string const &n) :
      name(n), p(std::chrono::high_resolution_clock::now()) { }
  ~profiler()
  {
      using dura = std::chrono::duration<double>;
      auto d = std::chrono::high_resolution_clock::now() - p;
      std::cout //<< name << ": "
          << std::chrono::duration_cast<dura>(d).count()
          << std::endl;
  }
};

#define PROFILE_BLOCK(pbn) profiler _pfinstance(pbn)

struct node {
  char value = ' ';
  std::vector<node*> children;
  ~node(){
    for (node* child: children)
      delete child;
  }
};

int numberOfUniqueSubstrings(const std::string aString, node*& root)
{
    root = new node();
    int substrings = 0;
    for (int i = 0; i < aString.size(); ++i)
    {
        node* currentNode = root;
        int indexToNext = i;
        for (int j = 0; j < currentNode->children.size(); ++j)
        {
            if (currentNode->children[j]->value == aString[indexToNext])
            {
                currentNode = currentNode->children[j];
                j = -1;
                indexToNext++;
            }
        }
        for (int j = indexToNext; j < aString.size(); ++j)
        {
            node* theNewNode  = new node;
            theNewNode->value = aString[j];
            currentNode->children.push_back(theNewNode);
            currentNode = theNewNode;
            substrings++;
        }
    }
    return substrings;
}


int main(int argc, char **argv){
  const int MAX_LEN = 1300;

  if(argc==1){
    std::cerr<<"Syntax: "<<argv[0]<<"<SEED> [LENGTH]"<<std::endl;
    std::cerr<<"Seed of -1 implies all lengths should be explore and input randomized from time."<<std::endl;
    std::cerr<<"Positive seed sets the seed and explores a single input of LENGTH"<<std::endl;
    return -1;
  }

  int seed = std::stoi(argv[1]);

  if(seed==-1)
    srand(time(NULL));
  else
    srand(seed);

  //Generate a random string of the appropriate length
  std::string a;
  for(int fill=0;fill<MAX_LEN;fill++)
      a.push_back('a'+rand()%26);

  //Generate a list of lengths of strings to experiment with
  std::vector<int> lengths_to_try;
  if(seed==-1){
    for(int i=1;i<MAX_LEN;i++)
      lengths_to_try.push_back(i);
  } else {  
    lengths_to_try.push_back(std::stoi(argv[2]));
  }

  //Enable this line to randomly sort the strings
  std::random_shuffle(lengths_to_try.begin(),lengths_to_try.end());

  for(auto len: lengths_to_try){
    std::string test(a.begin(),a.begin()+len);

    std::cout<<len<<" ";
    {
      PROFILE_BLOCK("Some time");
      node *n;
      int c = numberOfUniqueSubstrings(test,n);
      delete n;
    }
  }
}

substr is a "constant"

OP's original code included the following:

for (int i = 0; i < aString.size(); ++i)
{
  string tmp = aString.substr(i, aString.size());

The substr operation here takes O(n) time in the length of the string. In an answer below, it is argued that this O(n) operation results in the poor performance of OP's original code.

I disagree with this assessment. Due to caching and SIMD operations, CPUs can read and copy data in blocks up of up to 64 bytes (or more!). Due to this, the costs of memory allocation can dominate the cost of copying the string. Thus, for OP's input sizes, the substr operation acts more like an expensive constant than an additional loop.

This can be demonstrated via testing by compiling the code with, e.g. g++ temp.cpp -O3 --std=c++14 -g and profiling with, e.g. sudo operf ./a.out -1. The resulting time-use profile looks like this:

25.24%  a.out    a.out                [.] _ZN4nodeD2Ev        #Node destruction                                                                           
24.77%  a.out    libc-2.24.so         [.] _int_malloc                                                                                    
13.93%  a.out    libc-2.24.so         [.] malloc_consolidate                                                                            
11.06%  a.out    libc-2.24.so         [.] _int_free                                                                                      
 7.39%  a.out    libc-2.24.so         [.] malloc                                                                                        
 5.62%  a.out    libc-2.24.so         [.] free                                                                                          
 3.92%  a.out    a.out                [.] _ZNSt6vectorIP4nodeSaIS1_EE19_M_emplace_back_auxIJRKS1_EEEvDpOT_                              
 2.68%  a.out    a.out                [.]
 8.07%  OTHER STUFF

From which it is evident that memory management dominates the run-time.

I think everything you wrote - and it' s a nice analysis btw - is consistent with my theory that this is an artifact of malloc's free list maintenance. This could be further supported by avoiding malloc in the timing completely: pre-allocating a pool of nodes with character and child arrays of the max possible size rather than allocating on the fly. — Gene, Jan 27 '17 at 01:44
Thanks, @Gene: I thought your answer got to the heart of the matter (and upvoted accordingly). I'd be nervous, though, about using results from runs without malloc, since that would simultaneously affect caching behavior. I'm not too familiar with profiling for this sort of effect, but that seems the best empirical route. — Richard, Jan 27 '17 at 02:35

interjay · Answer 2 · 2017-01-26T14:24:02.660

10

for (int i = 0; i < aString.size(); ++i)
{
    string tmp = aString.substr(i, aString.size());

This already makes your algorithm O(n^2) or worse. The call to substr creates a n/2 size substring on average, so it takes O(n), and you call it n times.

It seems that you don't actually need the tmp string since you only read from it. Instead, read from the original string but change your indexing accordingly.

The for (int j = indexToNext; j < tmp.size(); ++j) loop will probably also give your algorithm O(n^2) total time (I say "probably" because it depends on the calculated value of indexToNext, but from testing with random strings it appears to hold true). It is run O(n) times, and each time will take up to O(n) iterations.

edited Jan 26 '17 at 14:24

answered Jan 22 '17 at 19:24

interjay

107,303
21
270
254

5

You should add a section about `std::string_view`, which will allow him to manipulate substrings without copy. – Guillaume Racicot Jan 22 '17 at 19:33
I edited adding information about your recommendation. It does indeed improve the code but it reveals, in my opinion, an even stranger performance... situation. – mjgalindo Jan 22 '17 at 19:39
I disagree with this analysis. It's more appropriate to view the call to `substr` as taking `O(1)` time since OP is working with smallish (~1000 byte) strings and _any_ read from memory is going to load nearby elements. Further, testing your suggestion by passing a constant string reference to the subroutine along with the length resulted in worse performance. – Richard Jan 24 '17 at 23:12
@Richard 1000 bytes is definitely large enough here to cause an issue. `substr` is O(k) where k is the size of the result. And you can clearly see how the original graph in OP's post was parabolic, whereas the graph in the edit after he fixed the issue I pointed out looks linear. I have no idea what you mean by "passing a constant string reference to the subroutine along with the length" as I suggested no such thing. By the way, saying that something is O(1) because it was tested on a small size shows that you don't understand big-O notation (which is concerned with asymptotic performance). – interjay Jan 25 '17 at 01:16
I'm not putting too much faith in OP's graphs: before they applied the fix you suggested they showed a 0-1400 x-range. Afterwards, they show a much larger x-range which looks more linear. The 0-1000 range could pass for parabolic on all the graphs. – Richard Jan 25 '17 at 03:17
The behavior of algorithms can change markedly based on sample size due to caching effects. I reflect your statement back to you: since OP is using short strings, the `O(k)` complexity you claim for `substr` does not apply. Examples 1 and 2 at [this site](http://igoro.com/archive/gallery-of-processor-cache-effects/) provide a brief explanation of the effect. I agree that you are saving OP time, but I think the savings are coming from reducing memory allocation and destruction, not from the trivial time it is taking to copy the string. – Richard Jan 25 '17 at 03:17
@Richard The CPU cache doesn't change big-O complexity, only the constant factor involved. And do you really think that copying 1000 bytes takes the same amount of time as copying 1 byte, considering that the cache line size is only 64 bytes? If you do, I suggest you test it. Although you'd probably choose to ignore those results just like you chose to ignore the graph which proves you wrong... – interjay Jan 25 '17 at 18:53
@interjay: Big-O is always with reference to a cost model. I think you're mistaken about which costs are important here. As such, I think you are misinterpreting what the graphs are saying. – Richard Jan 25 '17 at 19:14
@interjay: I profiled OP's code and, as I suggested above, memory _allocation_, not the `substr` copy, is the dominant cost. Removing this allocation from the inner loop is the reason for the increased efficiency. I've posted the testing methodology [here](http://pastebin.com/JZG8TV6d); if you disagree with it, I'd be happy to look at alternative results. – Richard Jan 25 '17 at 19:39
@Richard The last inner loop also appears to make the algorithm O(n^2), in addition to the `substr` call. Since memory allocation is more expensive than copying a single byte, this will obviously be what you see in the profiler. But that doesn't change the fact that `substr` also causes O(n^2) runtime. I suggest you refrain from correcting people about big-O analysis, as you do not appear to understand the subject. – interjay Jan 26 '17 at 14:27
@interjay: Your answer and previous comments imply that it is the _O(n)_ nature of the string copy that makes OP's algorithm slow. I agree that will be true for large _n_, but, as I've argued, and demonstrated, it is not the case for the input sizes OP depicts. You made a misdiagnosis here, and it's okay to admit that. – Richard Jan 26 '17 at 18:47
@Richard Sigh... The O(n) nature of string copy *does* make the algorithm slow (more precisely: makes it O(n^2)). As does the O(n) nature of the last inner loop. You haven't demonstrated anything that contradicts this - you just think you do because you don't know what O(n^2) means. Unlike you, I do admit mistakes, but I stand behind this answer. You seem to assume that I claimed that substr is taking the majority of the time here, but my answer does not claim that at all - it is purely your misunderstanding. – interjay Jan 26 '17 at 18:48
@interjay: Let me try to understand again, then. You claim that the cost of `substr` is `O(n)`, at least for large `n`. I don't disagree with this. I claim that, since the cost of allocation dominates for OP's depicted input sizes, that `substr` is effectively acting as an expensive constant in this case. Do you disagree? – Richard Jan 26 '17 at 19:18
@Richard `substr` is not acting as a constant but as a O(n) operation. It's just that there is another O(n) inner loop that has a larger constant factor. Either of these things separately is enough to make this algorithm O(n^2). – interjay Jan 26 '17 at 19:48
@interjay: I think my tests demonstrate otherwise, but, if you can provide profiling data showing that `string tmp = aString.substr(i, aString.size());` has a run-time dependency on `n` for OP's input sizes, I'll be happy to believe you. The code I provide in my tests should be a decent base to work from. – Richard Jan 26 '17 at 20:00
@Richard What I said in my last comment is easily provable mathematically. Your tests do not demonstrate anything at all about the asymptotic performance of substr - only about its performance relative to other parts of the algorithm, which is not relevant to my claims. If you want to test it, remove everything except the substr and you will still see it behaving quadratically. – interjay Jan 26 '17 at 20:06
@interjay: I've claimed that you can treat the `substr` as a constant for OP's input sizes precisely because its `O(n)` cost is negligible in comparison to other parts of the algorithm (specifically the allocation needed to hold the results of the `substr`). It sounds like you agree with me. If not, I'm happy to look at any tests you provide. – Richard Jan 26 '17 at 20:13

Gene · Answer 3 · 2017-01-23T05:29:20.410

I'd suspect malloc more than string or vector. It's entirely possible that it treats < 1000 bytes and > 1000 bytes differently. Coalescing freed blocks can be expensive. It may refrain from trying to coalesce the larger blocks (i.e. by maintaining them in pools). But this is really just a guess. Why don't you try a profiler and get some real data? gprof is easy to use.

This article has some interesting details on glibc malloc. If that's what's under the hood of your program, the differences among bin types described there could be at work. Indeed chunks are freed to an "unsorted bin" that's occasionally reorganized. It's possible the spikes are these reorganizations to prevent growing the heap. If that theory is correct, then the smoothing out could be the result of growing the heap arena to a size where reorganization isn't as expensive.

But again, this is all conjecture that can be resolved by running the profiler to see where the time is going at <1000 vs >1000.

Just added the profiler results for 999 and 1001. It's the first time I run gprof so it may not be what was expected. I compiled with the flag `-pg` and then run the analysis with `gprof -b a.out gmon.out > analysis.txt` (following a tutorial). — mjgalindo, Jan 23 '17 at 08:14

Strange algorithm performance

3 Answers3