Fast way to "backup" (copy) large STL vector of custom objects

Question

I'm using a STL vector to store a large (~ 10^6) number of custom objects (sizeof() gives 368 bytes for one of these objects). The structure of my program requires me to make frequent backup copies of this vector as changes made in a certain step might need to be unrolled under certain conditions. Roughly this looks something like

std::vector<myClass> vecA( largeNumber );
std::vector<myClass> vecB;

do
{
  vecB = vecA;

  //do lots of stuff to vecA

  if ( restoreBackup ) { vecA = vecB; }

} while (someCondition)

Doing some profiling, the copy operation vecB = vecA is actually a considerable bottle neck. I really need to speed up this part. Which strategies would you suggest?

Remarks:

For certain reasons I would like to keep the STL vector structure.
I know that most of the times only a small fraction of the elements in vecA gets updated. So I could try to maintain an extra list of these elements and then only copy these elements back and forth by looping over this list. Would that be an adequate strategy?
The question is certainly related to this question, however the solutions proposed there don't really help.

score 2 · Accepted Answer · answered Apr 20 '11 at 09:27

2

Hm, this is not an trivial thing and I don't think that you can avoid this copying, if you really need to use only (and one) std::vector.

The best solution would be to use Persistant Data Structure.

But this would be very helpful, if you need to store more than one version of your container.

If you need just the previous version..
The easies thing, that comes to my mind and it seems, that will work for you, is to backup only the elements, you'll change. You could use a std::list (or std::vector ) with std::pair< int, myClass >. So, the first of the pair would be the index of the element, you're changing and the second - the backed up version. So, in the end, if you need to restore the backup, you'll just to go through this container and "revert" the old data.

The situation with deleting/adding elements will be more complicated, but not unsolvable. You'll have to have more containers - one for deleted elements, one for added elements, one for changed elements and one for the sequence of the changes. So, this will give you the opportunity to perform "undo" steps.

Well, this sounds more complicated and kinda hard to implement (because of all possible cases), but will increase the performance (will reduce the copies).

Other thing, that comes to my mind - you could check the container size first. And if it's small, make a whole copy. If it's big - perform the "selective backup" or whatever this is called.

Note, that this is not the best solution (I guess), but just an idea, I got right now (I have never needed such thing)

Hope that helps, even-thought it sounds complicated and hard to implement.

answered Apr 20 '11 at 09:27

Kiril Kirov

37,467
22
115
187

The size of the vectors is actually fixed within the while-loop (I forgot to mention this). So your suggestion using a container together with a std::pair actually sounds quite appealing and somewhat more elegant than my "additional index list" idea. I'll give it a shot. – janitor048 Apr 20 '11 at 09:34
Well, are you going to make changes as `delete` or `insert`? If not, I think that the `std::list< std::pair <> >` will help you (I'd suggest `std::list` as you'll need fast `insert` and `delete` operations from the list and you'll always iterate through the **whole** container(if backup is needed); also, you'll avoid any reallocations, what `std::vector` could make and slow your application down) – Kiril Kirov Apr 20 '11 at 09:36
No, within the loop there is no `delete` or `insert` going on. So your solution should work. And yes, `std::list` would clearly be more appropriate than `std::vector` in this case. I'll report... – janitor048 Apr 20 '11 at 09:40
2

if you don't have frequent deletes and inserts, in the middle of the data structure, then `std::list` is a *horrible* idea. It takes an eternity to iterate over, due to the constant cache misses and pointer indirections, compared to `std::vector`. Even with the occasional reallocation, a `std::vector` is typically faster. Iterating over a linked list looks tolerable on paper (because the overhead of jumping to the next node is constant), but it is *far* more expensive than with a vector. – jalf Apr 20 '11 at 09:49
@jalf - I'm very surprised by that. If the container is big, and a lot of things should be backed up, there will be a lot of reallocations. I agree, that iterating through `list` is slower that through `vector`, but it depends on the situation. I'm not sure if 20 reallocations (for example) will be faster that not reallocations and just **one** iteration through a list. Anyway, as I see, janitor048 is familiarized with profiling and some tests would be helpful here, I guess. – Kiril Kirov Apr 20 '11 at 09:55
@Kiril: but 20 reallocations can be reduced to 1 pretty easily. It just requires him to have a rough idea of the number of elements to be stored (and it sounds like the size is known). But yeah, need to profile to be sure. – jalf Apr 20 '11 at 10:04
@jalf - yep, you're right, that if there's info about the number of elements or at least to know what part of the container would need backup, `.reserve` would be very helpful and `std::vector` would be faster. But it depends and in the most cases, such info is not available. But right, I'd vote for tests, too (: – Kiril Kirov Apr 20 '11 at 10:47
@jalf, @Kiril: Ok, I've done some simple unit tests. First of all: The partial backup scheme really saves lots of time (will in reality of course depend on the ratio of # / #). Thanks! Using `std::list` as backup container instead of `std::vector` is roughly 30% faster in my tests. Oddly enough,this is also the case when I reserve the vector capacity beforehand. Moreover there is a significant difference in the distribution of resource usage. `vector`: _User time: 128.5, System time: 21.85_ `list`: _User time: 95.6, System time: 0.18_ (measured by gnu time) – janitor048 Apr 20 '11 at 11:50
@jalf, @Kiril: While I know the total size of my vector beforehand, the number of items that are actually changed (and thus require backup) varies a lot. I could try to estimate it but this would be very rough and for some reason `list` appears to be faster anyhow (see above). It's a physics simulation program btw. – janitor048 Apr 20 '11 at 11:54
@janitor048 - I'm really glad to hear that (: Good luck ;) – Kiril Kirov Apr 20 '11 at 12:03
@janitor048: you are compiling with optimizations enabled, right? (And if you're on VS2005 or 2008, with `SECURE_SCL=0`) – jalf Apr 20 '11 at 12:04
@jalf: Yes, optimization should be enabled. I'm using cmake 2.8 with DCMAKE_BUILD_TYPE=Release together with g++ 4.4.3 – janitor048 Apr 20 '11 at 12:07

score 0 · Answer 2 · answered Apr 20 '11 at 09:47

An alternative to Kiril's suggestions are to make full copies, but swap (or move) in the "restore backup" phase.

That is, replace

if ( restoreBackup ) { vecA = vecB; }

with

if ( restoreBackup ) { std::swap(vecA, vecB); }

Then the "restore backup" part becomes essentially free, and you only have the (full) backup itself to worry about. Of course, this makes any "partial backup" strategy impossible.

It sounds like some kind of "partial" backup scheme would be more efficient, but it never hurts to mention an alternative. :)

Yes, you're right. I've already considered using swap for the restore part. However, in my specific case the backup needs to be restored only in a small fraction of all iterations. Thus the critical part really is the creation of the backup copy.. — janitor048, Apr 20 '11 at 10:00

score 0 · Answer 3 · answered Apr 20 '11 at 15:14

0

If myClass is a POD, and if both vectors are already the appropriate size, you could use memcpy:

memcpy(&dstVec[0], &srcVec[0], sizeof(myClass)*srcVec.size());

answered Apr 20 '11 at 15:14

Robᵩ

163,533
20
239
308

`myClass` holds instances of other classes and also contains some none trivial STL containers itself. So I'm afraid `memcpy` won't work in my specific case. – janitor048 Apr 20 '11 at 22:15

score 0 · Answer 4 · answered Apr 20 '11 at 15:39

I know that you want to keep your existing std::vector. But this problem sounds like it was made for STM (Software Transactional Memory). It might be worth looking into.

As an alternative, perhaps you can change the code that modifies your vector so that instead of modifying it directly, it produces a list of changes to be made. After the list of changes is approved it can then be applied to the vector.

Another way to do that is to make the changes and build an undo list that will restore the vector to its original form.

All of this is a lot more complicated than making copies of the entire vector, but repeatedly copying megabytes back and forth is going to be slow with no way around it.

being more like a "semi-amateur" I must admit that I've not stumbled across STM before. But it sure sounds interesting, I'll read up more to see if it might be worth spending some time on for my projects. Triggered by the discussions above I've already started to restructure my code such that only actual changes are tracked rather than doing full copies (your undo list version, since accepting the changes is the regular and more frequent case). SO is just great for getting suggestions and ideas. Thanks to all! — janitor048, Apr 20 '11 at 22:28

Fast way to "backup" (copy) large STL vector of custom objects

4 Answers4