Which data structure is more efficient to store visited node count while solving graph problems in python?

Question

I have been solving graph problems in Python and using a set() to store already visited nodes to prevent getting stuck in a loop. However, I have seen people who code in C++/Java use vectors or arrays to store visited counts. I am wondering which approach is more efficient in Python or in general for time and space complexity.

For instance, considering we already know the index of the node:

visited = [0] * V     # space = O(V)
if not visited[node]  # takes Time = O(1)

visited = set()          # space = O(V) worst case
if node not in visited:  # takes O(1)

But according to this answer Time complexity of python set operations? sometimes set() complexity can be O(N). So, is it better to use an array?

if you already know the index then you should use an list or dictionary. sets don't have an indexing mechanism so you would lose all the benefits of having that information by using a set. — Alexander, Jul 27 '23 at 05:47
May be [dictionaries](https://docs.python.org/3/tutorial/datastructures.html#dictionaries) ? It should be O(1) complexity — Eugene Mikhalev, Jul 27 '23 at 07:04

score 0 · Answer 1 · answered Jul 27 '23 at 06:33

List in python is a good replacement for an array in C and list is even better than that of an array to store data in it. Consider using list instead of set because I have seen how versatile it is to use a list in python because list is mutable and for your code to keep on adding visited elements of the graph, using list is a good option I guess. Anyways, happy coding!

trincot · Answer 2 · 2023-07-27T14:08:58.827

You cannot always choose between the two.

If node is an unsigned integer with a small range, then using them as a list index will be the better choice: in that case you don't need the logic of a hashmap (the underlying technology for a set), as there is no collision management needed: the index uniquely identifies the entry in the list. Once you have allocated the list (which is O()), the relevant operations of getting and setting are all O(1).

If node does not fit that condition, and node cannot be mapped to such a small unsigned integer (with an injective mapping that has O(1) running time), then the choice for a list is fading away. node cannot serve as index anymore. Of course, you could still append node (as value) to a dynamic list, but then the in operation degrades to O() complexity. That's not what you want.

In such cases the use of a set is the right choice. For example, if you use an OOP approach, then node will be the instance of some Node class, having attributes for its key, label, and neighbors. Unless you have a way to map those node objects to small unsigned integers, they cannot serve as list indices, and set will be the logical option to go for.

Concerning the complexity of operations on a set: in practical cases you may assume O(1) amortized time complexity for getting and setting. The cases where collisions would degrade the complexity to O() are rare.

"node cannot be mapped to a small unsigned integer" When would this happen? — ravenspoint, Jul 27 '23 at 13:53
@ravenspoint, they could be mapped when they are integers that are multiples of 1000, or integers that are consecutive but start at one million, or objects that have an attribute that can serve as index, or ...etc. Of course, you could always map them (in constant time) even if none of this is applicable, by ... some hashing method. But that is what a set provides. — trincot, Jul 27 '23 at 13:57
AFAIK all those things you mention can be easily and quickly mapped to an integer smaller or equal to the vertex count. — ravenspoint, Jul 27 '23 at 13:59
That's what I mean? Maybe I understood your first comment in the opposite sense. — trincot, Jul 27 '23 at 13:59

ravenspoint · Answer 3 · 2023-07-27T14:57:24.743

Throwing around big O values is fun and easy. But at some point it is necessary to sit down and write some code and so you can then do timing tests on the actual implementations.

Code to compare C++ implementations of tracking visited vertices using a vector and a set.

#include <string>
#include <fstream>
#include <sstream>
#include <iostream>
#include <vector>
#include <set>
#include <algorithm>
#include "cRunWatch.h"

const int vertex_count =  1000000;
const int visited_count =  100000;

std::vector<bool> vVisited( vertex_count,false);
std::set<int> setVisited;

int randomVertex()
{
     raven::set::cRunWatch aWatcher("randomVertex");
    return rand()%(vertex_count-1);
}

int useVector()
{
    raven::set::cRunWatch aWatcher("useVector");

    // Visit some vertices;
    for( int k = 0; k < visited_count; k ++ )
    {
        vVisited[ randomVertex() ] = true;
    }

    // check if some vertices have been visited
    int total_visits_found = 0;
    for( int k = 0; k < visited_count; k ++ )
        if( vVisited[randomVertex()] )
            total_visits_found++;

    return total_visits_found;
}


int useSet()
{
    raven::set::cRunWatch aWatcher("useSet");

    // Visit some vertices;
    for( int k = 0; k < visited_count; k ++ )
    {
        setVisited.insert( randomVertex() );
    }

    // check if some vertices have been visited
    int total_visits_found = 0;
    for( int k = 0; k < visited_count; k ++ )
         if( setVisited.count(randomVertex() ) )
            total_visits_found++;

    return total_visits_found;
}
main()
{
    raven::set::cRunWatch::Start();

     /* initialize random seed: */
  srand (time(NULL));

    useVector();
    useSet();

    raven::set::cRunWatch::Report();

    return 0;
}

Results:

raven::set::cRunWatch code timing profile
Calls           Mean (secs)     Total           Scope
       1        0.154407        0.154407        useSet
       1        0.0599845       0.0599845       useVector
  400000        3.1321e-08      0.0125284       randomVertex

Note: these results are consistent on multiple runs with different random seeds

Conclusion:

The vector implementation is about three times faster than using a set.

Which data structure is more efficient to store visited node count while solving graph problems in python?

3 Answers3