0

So I had a programming contest problem which involved running a lot of DFS on different graphs.

At first I represented my graph (adjacency lists representation) as an vector of sets:

vector< set<int> > graph;

Also to initialize a graph according to the given number of nodes each time i used an empty set:

set<int> tmpSet;

and I initialized it like that:

for(int j=0;j<N;j++)//N was the number of nodes needed for the graph
   graph.push_back(tmpSet);

and i used

graph.clear();

to empty the graph each time.

To insert edges afterwards i used the insert function of std::set.

//Insert directed edge from u to v
graph[u].insert(v);
graph[v].insert(u);

The result was the program consumed too much memory and was too slow to pass the tests. Same thing happened with std::list using the push_back function which is a constant time operation. Then when I changed to std::vector memory consumption became minimal and I passed the tests in 3secs while std::set and std::list couldn't pass them even in 20secs.

My quess is that it had something to do with freeing the space of the inner set and list, but how come vector behave differently?

So my question is if anyone can explain why was this happening so I can understand better how stl containers behave in situations like that where you have a container inside another container.

EDIT: Some extra info: The number of nodes was about N=3000 and the mumber of tests over 1000. That means I had to create over 1000 graphs all held in the variable 'graph'. Also I'm aware that set inserts in O(lgn) time while vector and list in O(1), so I understand why set would take only a bit longer than vector. But then why did the std::list failed also? Also let me mention that set and list finished with 100Mb usage of memory while vector finished with 3Mb

Ok final edit, here's my code to show exactly how I'm using the graph (the list version). Nowhere else in the program any freeing of memory or changing the graph data occurs.

vector< list<int> > graph;
list<int> tmpList;
int T; //num of test cases
int N; //num of nodes
int M; //num of edges
int main ()
{
    int u,v;
    scanf("%d",&T);//Read test cases
    for(int i=0;i<T;i++){

        scanf("%d %d",&N,&M);//Read number of nodes and number of edges
        for(int j=0;j<N;j++)
            graph.push_back(tmpList);

        for(int j=0;j<M;j++){
            scanf("%d %d",&u,&v);//Read edge from u to v
            graph[u].push_back(v);
            graph[v].push_back(u);
        }
        dfs();
        graph.clear();
    }
}
Paris P
  • 335
  • 3
  • 10
  • What is the exhibited behavior when you *remove* the test logic *entirely* and simply load your containers, then release them? Is there a remote chance for an SSCCE that exhibits your observed behavior to be provided? Did you consider a swap/clear idiom for your "clear"? And again, how were the containers *accessed/enumerated* during your "tests"? If you want a concrete answer to this that is not sheer guesswork/speculation more info will likely need be provided. – WhozCraig Apr 10 '13 at 12:47

3 Answers3

2

When you use std::set to hold adjacent node numbers you insert and get a element in logarithmic time which is slow. But when you use std::vector insert(push_back) and getting an element are done in constant time, therefore the difference in time. So you should use std::vector when you don't need to find some element in set, and use std::set otherwise.

The difference between std::list and std::vector may be because of clear function. For list it is linear but for vector it is atomized constant.

Ashot
  • 10,807
  • 14
  • 66
  • 117
  • I would understand if that was the case but same thing happened with std::list which offers push_back in O(1) time. – Paris P Apr 10 '13 at 12:06
1

A set is ordered. It is guaranteed to remain in a specific ordering, according to a functor that you provide. No matter what elements you add or remove (unless you add a duplicate, which is not allowed in a set), it will always be ordered.

A vector has exactly and only the ordering you explicitly give it. Items in a vector are where you put them. If you put them in out of order, then they're out of order; you now need to sort the container to put them back in order.

Admittedly, set has relatively limited use. With proper discipline, one could insert items into a vector and keep it ordered. However, if you are constantly inserting and removing items from the container, vector will run into many issues. It will be doing a lot of copying/moving of elements and so forth, since it is effectively just an array.

The time it takes to insert an item into a vector is proportional to the number of items already in the vector. The time it takes to insert an item into a set is proportional to the log of the number of items. If the number of items is large, that's a huge difference. Log(100,000) is 5; that's a major speed improvement. The same goes for removal.

However, if you do all of your insertions at once, at initialization time, then there's no problem. You can insert everything into the vector, sort it (paying that price once), and then use standard algorithms for sorted vectors to find elements and iterate over the sorted list. And while iteration over the elements of a set isn't exactly slow, iterating over a vector is faster.

So there are cases where a sorted vector beats a set. That being said, you really shouldn't bother with the expense of this kind of optimization unless you know that it is necessary. So use a set unless you have experience with the kind of system you're writing (and thus know that you need that performance) or have profiling data in hand that tells you that you need a vector and not a set.

Sooraj Chandu
  • 1,348
  • 16
  • 35
0

As usual the best answer to performance questions is to profile both implementations for your use case and see which is faster.

In general if you have insertions into the data-structure (other than at the end) then vector may be slower, otherwise in most cases vector is expected to perform better than list if only for data locality issues, this means that if two elements that are adjacent in the data-set are adjacent in memory then the next element will already be in the processor's cache and will not have to page fault the memory into the cache.

Also keep in mind that the space overhead for a vector is constant (3 pointers) while the space overhead for a list is paid for each element, this also reduces the number of full elements (data plus overhead) that can reside in the cache at any one time.

Sooraj Chandu
  • 1,348
  • 16
  • 35