8

Is there a well-known solution to the following problem?

  • You have a vector of a lot of strings

  • You happily populate it with a few hundred thousand strings, and it's fast

  • You manipulate your strings in arbitrary ways; life is good.

  • You're done with the vector; vector goes out of scope, and now you have to go grab some coffee and sit back while each string gets destroyed one-by-one.

Edit: Problem solved!

I just ran the code below on Linux on the same computer and it was fine, which led me to figure out the solution. It turned out to be with my system -- something I'd caused myself, a long time ago, but which I'd forgotten.
Upon fixing the problem, the time decreased dramatically, to even better than with GCC!

It's a good puzzle though, so instead of posting the answer, I'll do something else:
I'm not allowed to place a bounty on this question right now, but if you think you know the cause, give it a shot. If it's correct, I'll accept it and give you a nice bounty. (Remind me if I forget to give you the bounty!)
If no one knows then I'll just post it myself after some time passes.

Sample code:

I used to be as skeptical as anybody, but now I guess people do have a point when they the STL is slow!
This took 3.95 seconds on my laptop: (the shuffling is critical)

#include <cmath>
#include <cstdio>
#include <cstdlib>
#include <ctime>
#include <string>
#include <vector>

int main()
{
    using namespace std;
    srand((unsigned)time(NULL));
    clock_t start;
    {
        vector<string> v(400000);
        for (size_t i = 0; i < v.size(); i++)
        {
            v[i].resize(16 + rand() % 32);  // some variation
        }

        // shuffle
        for (size_t i = 0; i < (size_t)pow((double)v.size(), 1.15); i++)
        {
            size_t j = rand() * (RAND_MAX + 1) + rand();
            swap(v[i % v.size()], v[j % v.size()]);
        }
        
        printf("Going out of scope...\n"); fflush(stdout);
        start = clock();
    }
    clock_t end = clock();
    printf("%u ms\n", (end - start) * 1000 / CLOCKS_PER_SEC);
    return 0;
}

It looks to me like this program is using some O(n2) algorithm internally, either in Visual C++ or in Windows. Not sure what's happening, but it's interesting...

Community
  • 1
  • 1
user541686
  • 205,094
  • 128
  • 528
  • 886
  • 2
    Sounds like you already answered your question: `"grab some coffee"` - in all seriousness, is it really that bad? – Mysticial Jul 18 '12 at 01:01
  • @Mysticial: lol. It introduces a ~3-second pause in the release version of my app, which hurts the UI because the user doesn't expect it. – user541686 Jul 18 '12 at 01:02
  • @Mehrdad: How big are your strings? How much data is being allocated by the > 100k strings? Is there any alternative representation you can use that would be lighter than a string? – Mike Bailey Jul 18 '12 at 01:03
  • 4
    Woah... 3 seconds to deallocate a few 100k strings... That's pretty slow... More than a few thousand cycles per string. What memory allocator is this? – Mysticial Jul 18 '12 at 01:03
  • @MikeBantegui: It's a bunch of file names, so perhaps typically 32 characters or so? Up to 256 each. – user541686 Jul 18 '12 at 01:03
  • @Mehrdad: Is a struct containing a fixed size char array (say 256 chars) an option? You could bypass the destructor then. – Mike Bailey Jul 18 '12 at 01:04
  • @Mysticial: It's the standard allocator (VC++)... it's usually around ~1 second or so; 3 seconds was one of the extremes. – user541686 Jul 18 '12 at 01:04
  • @MikeBantegui: No, that uses too much memory for my application, since file names are typically pretty short. :\ – user541686 Jul 18 '12 at 01:05
  • What's the total memory consumption of the entire vector? – Mysticial Jul 18 '12 at 01:07
  • @Mehrdad: You can still do variable length allocation. Just allocate the char arrays in the structs from a memory pool. Leave the destructor empty and just delete[] the memory pool when you're finished using them. – Mike Bailey Jul 18 '12 at 01:07
  • 2
    @MikeBantegui: I can't manipulate the strings then. – user541686 Jul 18 '12 at 01:07
  • 1
    Hmm, destructing a vector with 100k strings takes a few 10s of **milli**seconds on my crappy old laptop, running VS2010 without optimisations. Something else must be up here... – Oliver Charlesworth Jul 18 '12 at 01:09
  • @Mysticial: I'd have to get back to you on that -- there's more data than the vector so I'd have to measure separately. It's probably on the order of 10 MB but I could be wrong. – user541686 Jul 18 '12 at 01:10
  • 3
    @OliCharlesworth: Several seconds definitely sounds fishy. I have a numerical simulation that allocates 12 GB of memory in a tree that only takes a few milliseconds to deallocate all the data. There's no reason for 3+ seconds in deallocation on anything short of a few gigabytes of data. – Mike Bailey Jul 18 '12 at 01:10
  • @OliCharlesworth, MikeBantegui: Hmm, interesting. Might be a threading issue (something would have to pause though, which I don't really do)... let me check and I'll post here if figure out I'm wrong. – user541686 Jul 18 '12 at 01:11
  • I want to point out that it being unexpected to the user can be remedied with a nice loading, or whatever you want to call it, message that notifies them that the pause is intentional. – chris Jul 18 '12 at 01:15
  • 2
    Is is possible that (parts of) your data structure have been paged out while you are processing the filenames? In this case, deallocation might be slow as things were paged back in. – clstrfsck Jul 18 '12 at 01:28
  • @MikeBantegui, OliCharlesworth: Seems like you guys are right -- I can't reproduce it with a simple example either. I'm guessing it might be a thread blocking on something occasionally then (no idea on what...), because when my app quits, I see a 1+ second pause, and when I break into it, it's in the vector's destructor, deleting all the strings... thanks anyway. – user541686 Jul 18 '12 at 01:29
  • 1
    @msandiford: Nah, I don't have my page file turned on. It's probably something else as the other guys mentioned. – user541686 Jul 18 '12 at 01:29
  • @Mehrdad: Uh no page file? That sounds like a no-no right there. – Mike Bailey Jul 18 '12 at 01:30
  • 1
    @MikeBantegui: I have 6 GB of RAM, and only 1.5 GB used. Been running fine for more than a year without a page file, so I don't see the point of it. – user541686 Jul 18 '12 at 01:30
  • 2
    People benchmarking string destruction... are you allocating them all in nice neat order? The "manipulate in arbitrary ways" in the question suggests much worse locality than a naive straightforward time trial. – Ben Voigt Jul 18 '12 at 01:43
  • Okay that's weird, I just tried *without* a debugger (in case it was the debugging heap), in release mode, and again, it took ~1.5 seconds for the program to exit... hmmm.. – user541686 Jul 18 '12 at 01:45
  • @Mysticial: Looks like it's around 16 MB for 330,405 strings. – user541686 Jul 18 '12 at 01:56
  • Okay, this is ridiculous -- I created an extra vector *outside* my dialog loop, assigned (copied) the dialog's data to the vector, and deleted the dialog, along with the data. It took ~1.5 seconds for them to be deleted, as usual. *Then* I timed how long the *outer vector* took to be deleted. It took about ***17 (seventeen)*** seconds (and I've turned off the Windows debug heap, and it's in release mode)! I'm guessing @BenVoigt might be onto something here, because that number doesn't make sense otherwise. Let me see if I can somehow trim this down and post it (it's a toughie). – user541686 Jul 18 '12 at 02:16
  • For what it's worth, I just tried the above^ using the Windows 2003 DDK STL instead of VS 2008. The difference? It was about 17 seconds faster... (and `_SECURE_SCL` is indeed zero). – user541686 Jul 18 '12 at 02:20
  • @BenVoigt: You're 100% right -- it's a caching issue; see my edit. – user541686 Jul 18 '12 at 03:26
  • This prints `60000` without and `50000` with optimization (`-O3`) on my Ubuntu machine when compiled with the current GNU C++ compiler (g++). But the compiler has shown two warnings, too. (Integer overflow in expression + printf format string error -- probably meaningless for the issues you are experiencing.) – krlmlr Jul 18 '12 at 03:31
  • @Mehrdad: No, actually, I guess it's 60 milliseconds. This is determined by the `CLOCKS_PER_SEC` macro, I'm not sure what it is on my system. – krlmlr Jul 18 '12 at 03:33
  • @user946850: Oh I thought you meant seconds! Sorry, I fixed that part of the code. 60 milliseconds? o.O I guess GCC just beat the heck out of Visual C++... I wonder what makes the difference? – user541686 Jul 18 '12 at 03:34
  • 2
    Oh WTF... I run it within Visual Studio and it's `4.491` seconds. I run it by double-clicking on the .exe and it's `0.035` seconds. I don't even know where to begin debugging this one... – Mysticial Jul 18 '12 at 03:35
  • @Mysticial: Lol wtf! I'm just double-clicking and getting that result... (maybe try setting the `_NO_DEBUG_HEAP` environmental variable to `1` for VS? It might be the debugging heap in Windows.) – user541686 Jul 18 '12 at 03:36
  • 1
    @Mysticial: What happens if you hit Ctrl+F5 in Visual Studio? – krlmlr Jul 18 '12 at 03:44
  • 1
    @user946850 `4.975` seconds, same as clicking on the green arrow. – Mysticial Jul 18 '12 at 03:46
  • @Mysticial: I guess now we see I wasn't kidding when I said it takes this long... :P – user541686 Jul 18 '12 at 03:46
  • 1
    Heap coalescing may well be expensive in proportion to the number of free blocks (and the coalescing algorithm may run in proportion to the number of blocks freed). Hence O(n^2). – Ben Voigt Jul 18 '12 at 03:47
  • @BenVoigt: So you're thinking GCC/Linux don't do that here (or maybe haven't started to do that *yet*)? – user541686 Jul 18 '12 at 03:48
  • @BenVoigt That's what I'm starting to suspect too. The randomized order of the deallocating is leaving tons of holes which the memory allocator may be having trouble handling. But only with the debugging heap? (have yet to confirm if it is indeed the debugging heap...) – Mysticial Jul 18 '12 at 03:49
  • Not all heap managers are created equal, and Linux is a lot more aggressive about using newer faster ones, where MS cares more about backward compatibility. – Ben Voigt Jul 18 '12 at 03:49
  • 1
    Looks like I found your answer: http://stackoverflow.com/questions/1060337/why-does-my-stl-code-run-so-slowly-when-i-have-the-debugger-ide-attached – Mysticial Jul 18 '12 at 03:52
  • @Mysticial: lol, does the "`_NO_DEBUG_HEAP`" ring a bell? XD – user541686 Jul 18 '12 at 03:53
  • 1
    No, I had to google it and it led me to that SO question. I don't do enough allocation-intensive things to have encountered this before. It's new to me. Yeah... I've hit denormals, branch mispredictions, data-alignment, but not this... yeah. Okay... – Mysticial Jul 18 '12 at 03:56
  • I just tested MinGW-GCC 4.4.5 and I got the same timing as VC++ (within a few percent). I guess it's either the OS or the CPU... – user541686 Jul 18 '12 at 05:57

3 Answers3

8

Use custom allocator with deallocation en masse.

Serj-Tm
  • 16,581
  • 4
  • 54
  • 61
5

Okay, since nobody figured it out...

It was because heap tail checking was turned on in my system. Once I removed it, the code finished quickly.

user541686
  • 205,094
  • 128
  • 528
  • 886
  • Nice. How did you figure this out? – Viet Jul 27 '12 at 07:24
  • 2
    @Viet: I tried it on Linux, and it was fine, so I figured it's gotta be something low-level, heap-related in Windows. Then it it me to check Global Flags... and when I saw it enabled, I remembered that I'd enabled it about a year ago, since it didn't seem to have any adverse effects at the time. Turned out I was wrong. :P – user541686 Jul 28 '12 at 23:35
0

Why don't you create the vector itself dynamically so you can manage it with a reference counting smart pointer. Then, you can ensure the thread that has the last reference to it isn't the UI thread so the UI thread isn't the one doing the processing when it goes out of scope.

You could even manipulate the priority of the thread that does the processing so it's lower and doesn't affect the rest of your threads badly - it'll make sure the UI thread is scheduled in front of the lower priority thread.

Note: I've never tried that, but I don't see why it wouldn't work - but spend time on it at your own risk! :)

John Humphreys
  • 37,047
  • 37
  • 155
  • 255