Efficient storage, lookup and manipulation of a large undirected graph

Question

I have an undirected graph G with about 5000 nodes. Any pair of nodes may be connected by an edge. Length, direction, or other features of the edge are irrelevant, and there can be at most one edge between two points, so the relationships between nodes are binary. Thus, there are 12497500 total potential edges.

Each node is identified by a string name, not a number.

I would like to store such a graph (loaded as input data into my program) but I'm not sure what kind of data structure is best.

I will need to lookup whether a given pair of nodes are connected or not, many many times, so lookup performance is probably the primary concern.
Performance cost of adding and removing elements is not a concern.
I would also like to keep the syntax simple and elegant if possible, to reduce the possibility of introducing bugs and make debugging easier.

Two possibilities:

bool[numNodes, numNodes] and a Dictionary<string, int> to match each node name to an index. Pros: Simple, fast lookup. Cons: Can't easily remove or add nodes (would have to add/remove rows/columns), redundant (have to be careful about g[n1, n2] vs g[n2, n1]), clumsy syntax because I have to go through the HashMap every time.
HashSet<HashSet<string>>. Pros: Intuitive, nodes directly identified by strings, easy to "add/remove nodes" because only edges are stored and nodes themselves are implicit. Cons: Possible to enter garbage input (edges that "connect" three nodes because the set has three members).

Regarding the second option, I am also unclear about a number of things:

Is it going to take much more memory than an array of bool?
Are two .NET sets equivalent to mathematical sets in the sense that they are equal if and only if they have the exact same members (as opposed to being distinguished by capacity or order of elements and so on) for purposes of HashSet membership? (ie is querying with outerSets.Contains(new HashSet<string>{"node1", "node2"}) actually going to work?)
Is lookup going to take much longer than an array of bool?

"at most there can be one edge". Does that mean that for any pair of nodes they can only be connected by a single edge? — spender, Aug 08 '12 at 23:54
@spender Yes, for any given pair of nodes there can be one edge or none. If all of the 5000 nodes are connected, there will be 12497500 edges, 4999 connecting to each node. — Superbest, Aug 08 '12 at 23:57
Sure. the number of handshakes between n people. HashTables are pretty fast for this kind of thing, but significantly less so than an array. Abstract your operations to an interface rather than dealing with the edge finding/manipulation directly in client code, then if your 1st implementation seems creaky, you can swap it out for a completely different backend without having to rewrite the code that makes used of it. If speed is critical, and memory is plentiful then array seems like it would be best option, but tenfold more nodes would mean 100fold more memory. Test, test then test some more. — spender, Aug 09 '12 at 00:05
How long do you expect the node string identifiers to be, on average? If they are not too long, and since edges are undirected then it may not be too expensive to normalize your (node1, node2) coordinate into a single key value that can be used as a subsequent lookup in a single hashtable. eg, based on comparison of strings: `string key = (node1 < node2) ? node1+node2 : node2+node1;` — Monroe Thomas, Aug 09 '12 at 00:15
@MonroeThomas In my case, I don't see the names exceeding 10 characters, certainly not 20. The majority (and possibly all) are 7 alphanumeric (capital letters and numbers) characters long. — Superbest, Aug 09 '12 at 00:17
`outerSets.Contains(new HashSet{"node1", "node2"})` ain't gonna work btw. Things would get messy if it did, because it would mean that mutating an item in the set might break the contract that items are unique. HashSets are compared by reference only, not the values contained within. `.SetEquals` in HashSet would be a better way to compare. — spender, Aug 09 '12 at 00:24
Even without normalizing and string concatenation, one could create a datastructure with the correct equality and hashing methods as a lookup in a single hashtable. It's important that the members involved in equality and hashing are immutable though. — spender, Aug 09 '12 at 00:28

score 1 · Answer 1 · edited May 23 '17 at 12:23

1

The C5 Collections Library has some useful stuff regarding graphs

http://www.itu.dk/research/c5/

This question, Most efficient implementation for a complete undirected graph, looks useful as well.

SortedDictionary is under the hood, a height balanced red-black tree, so lookups are O(log n).

edited May 23 '17 at 12:23

Community

1
1

answered Aug 09 '12 at 01:13

Nicholas Carey

71,308
16
93
135

Monroe Thomas · Accepted Answer · 2012-08-09T02:08:04.287

I was curious about using string concatenation vs. a Tuple when generating a key representing an edge in a hashtable in order to approach O(1) lookup performance. Two possibilities here for handling the undirected edge requirement:

Normalize the key so that it is the same no matter which node is specified first in the description of the edge. In my test, I simply choose to take the node with lowest ordinal comparison value as being the first component in the key.
Make two entries in the hashtable, one for each direction of the edge.

A critical assumption here is that the string node identifiers are not very long, so that key normalization is inexpensive relative to the lookup.

The string concatenation and Tuple versions with key normalization seems to work about the same: was completing about 2 million random lookups in about 3 seconds in a VirtualBox VM in release mode.

To see if the key normalization was swamping the effect of the lookup operation, a third implementation does no key normalization, but maintains symmetric entries with respect to both possible directions of an edge. This seems to be about 30-40% slower on lookups, which was slightly unexpected (to me). Perhaps the underlying hash table buckets have higher average occupancy due to having twice the number of elements, requiring longer linear searches within each hash bucket (on average)?

interface IEdgeCollection
{
    bool AddEdge(string node1, string node2);
    bool ContainsEdge(string node1, string node2);
    bool RemoveEdge(string node1, string node2);
}

class EdgeSet1 : IEdgeCollection
{
    private HashSet<string> _edges = new HashSet<string>();

    private static string MakeEdgeKey(string node1, string node2)
    {
        return StringComparer.Ordinal.Compare(node1, node2) < 0 ? node1 + node2 : node2 + node1;
    }

    public bool AddEdge(string node1, string node2)
    {
        var key = MakeEdgeKey(node1, node2);
        return _edges.Add(key);
    }

    public bool ContainsEdge(string node1, string node2)
    {
        var key = MakeEdgeKey(node1, node2);
        return _edges.Contains(key);
    }

    public bool RemoveEdge(string node1, string node2)
    {
        var key = MakeEdgeKey(node1, node2);
        return _edges.Remove(key);
    }
}

class EdgeSet2 : IEdgeCollection
{
    private HashSet<Tuple<string, string>> _edges = new HashSet<Tuple<string, string>>();

    private static Tuple<string, string> MakeEdgeKey(string node1, string node2)
    {
        return StringComparer.Ordinal.Compare(node1, node2) < 0 
            ? new Tuple<string, string>(node1, node2) 
            : new Tuple<string, string>(node2, node1);
    }

    public bool AddEdge(string node1, string node2)
    {
        var key = MakeEdgeKey(node1, node2);
        return _edges.Add(key);
    }

    public bool ContainsEdge(string node1, string node2)
    {
        var key = MakeEdgeKey(node1, node2);
        return _edges.Contains(key);
    }

    public bool RemoveEdge(string node1, string node2)
    {
        var key = MakeEdgeKey(node1, node2);
        return _edges.Remove(key);
    }
}

class EdgeSet3 : IEdgeCollection
{
    private HashSet<Tuple<string, string>> _edges = new HashSet<Tuple<string, string>>();

    private static Tuple<string, string> MakeEdgeKey(string node1, string node2)
    {
        return new Tuple<string, string>(node1, node2);
    }

    public bool AddEdge(string node1, string node2)
    {
        var key1 = MakeEdgeKey(node1, node2);
        var key2 = MakeEdgeKey(node2, node1);
        return _edges.Add(key1) && _edges.Add(key2);
    }

    public bool ContainsEdge(string node1, string node2)
    {
        var key = MakeEdgeKey(node1, node2);
        return _edges.Contains(key);
    }

    public bool RemoveEdge(string node1, string node2)
    {
        var key1 = MakeEdgeKey(node1, node2);
        var key2 = MakeEdgeKey(node2, node1);
        return _edges.Remove(key1) && _edges.Remove(key2);
    }
}

class Program
{
    static void Test(string[] nodes, IEdgeCollection edges, int edgeCount)
    {
        // use edgeCount as seed to rng to ensure test reproducibility
        var rng = new Random(edgeCount);

        // store known edges in a separate data structure for validation
        var edgeList = new List<Tuple<string, string>>();

        Stopwatch stopwatch = new Stopwatch();

        // randomly generated edges
        stopwatch.Start();
        for (int i = 0; i < edgeCount; i++)
        {
            string node1 = nodes[rng.Next(nodes.Length)];
            string node2 = nodes[rng.Next(nodes.Length)];
            edges.AddEdge(node1, node2);
            edgeList.Add(new Tuple<string, string>(node1, node2));
        }
        var addElapsed = stopwatch.Elapsed;

        // non random lookups
        int nonRandomFound = 0;
        stopwatch.Start();
        foreach (var edge in edgeList)
        {
            if (edges.ContainsEdge(edge.Item1, edge.Item2))
                nonRandomFound++;
        }
        var nonRandomLookupElapsed = stopwatch.Elapsed;
        if (nonRandomFound != edgeList.Count)
        {
            Console.WriteLine("The edge collection {0} is not working right!", edges.GetType().FullName);
            return;
        }

        // random lookups
        int randomFound = 0;
        stopwatch.Start();
        for (int i = 0; i < edgeCount; i++)
        {
            string node1 = nodes[rng.Next(nodes.Length)];
            string node2 = nodes[rng.Next(nodes.Length)];
            if (edges.ContainsEdge(node1, node2))
                randomFound++;
        }
        var randomLookupElapsed = stopwatch.Elapsed;

        // remove all
        stopwatch.Start();
        foreach (var edge in edgeList)
        {
            edges.RemoveEdge(edge.Item1, edge.Item2);
        }
        var removeElapsed = stopwatch.Elapsed;

        Console.WriteLine("Test: {0} with {1} edges: {2}s addition, {3}s non-random lookup, {4}s random lookup, {5}s removal",
            edges.GetType().FullName,
            edgeCount,
            addElapsed.TotalSeconds,
            nonRandomLookupElapsed.TotalSeconds,
            randomLookupElapsed.TotalSeconds,
            removeElapsed.TotalSeconds);
    }


    static void Main(string[] args)
    {
        var rng = new Random();

        var nodes = new string[5000];
        for (int i = 0; i < nodes.Length; i++)
        {
            StringBuilder name = new StringBuilder();
            int length = rng.Next(7, 15);
            for (int j = 0; j < length; j++)
            {
                name.Append((char) rng.Next(32, 127));
            }
            nodes[i] = name.ToString();
        }

        IEdgeCollection edges1 = new EdgeSet1();
        IEdgeCollection edges2 = new EdgeSet2();
        IEdgeCollection edges3 = new EdgeSet3();

        Test(nodes, edges1, 2000000);
        Test(nodes, edges2, 2000000);
        Test(nodes, edges3, 2000000);

        Console.ReadLine();
    }
}

Efficient storage, lookup and manipulation of a large undirected graph

2 Answers2