Any number of link from a single node to nodes in a linked list?

Question

In C I can write singly, doubly or whatever 4, 5 links from a node. But can I make a structure like where, from a node the number of links is not specified, i.e. it has to add a new link dynamically. if it's not possible in C then where can I? For example: node "a" has links to node "b","c","d","e" and "f" and "b" has links to only "c","d","a" .. like that. So the number of links is not specified. Hope you understand my problem.

Nominal Animal · Answer 1 · 2017-04-19T02:06:51.477

Yes, by using a dynamically allocated array for the links. Typically, you'll want to store both the actual number of links (count), as well as the number of link pointers dynamically allocated (maxcount):

struct node {
    size_t        maxcount;
    size_t        count;
    struct node **child;
    /* Plus node data or other properties */
};

If each link (or edge) has an associated weight, you can use

struct node;

struct edge {
    struct node *target;
    double       weight;
};

struct node {
    size_t       maxcount;
    size_t       count;
    struct edge *edges;

    /* Visited flag, for full graph traversal */
    int          visited;

    /* Node data; here, name, as a C99 flexible array member */
    char         name[];
};

Note that Graphviz is an excellent tool for visualizing such graphs. I regularly write my tests to produce DOT language graph definitions on output, so that I can use e.g. dot from Graphviz to draw them as nice graphs.

If you have a directed graph using struct edge and struct node above, with all visited fields cleared to zero, you can safely -- that is, without getting stuck at cycles -- create the DOT output for such a graph using a recursive helper function, say

static void write_dot_node(struct node *graph, FILE *out)
{
    size_t  i;

    if (graph->visited)
        return;

    graph->visited = 1;

    fprintf(out, "\tnode%p [ label=\"%s\" ];\n", graph, graph->name);

    for (i = 0; i < graph->count; i++) {
        write_dot_node(graph->edges[i].target, out);
        fprintf(out, "\tnode%p -> node%p [ taillabel=\"%.3f\" ];\n",
                     graph, graph->edges[i].target,
                     graph->edges[i].weight);
    }
}

void write_dot(struct node *graph, FILE *out)
{
    if (!graph || !out)
        return;

    fprintf(out, "digraph {\n");
    write_dot_node(graph, out);
    fprintf(out, "}\n");
}

If you have truly huge graphs, the above may recurse too deep in some cases. It then needs to be converted to non-recursive loop that uses an explicit stack of nodes yet to be visited, with the write_dot function initializing and discarding the stack:

#define  VISITED_STACKED (1<<0)
#define  VISITED_VISITED (1<<1)

int write_dot(struct node *graph, FILE *out)
{
    struct node **unseen = NULL, **temp;
    size_t        unseens = 1;
    size_t        unseens_max = 1024; /* Initial stack size */

    unseen = malloc(unseens_max * sizeof *unseen);
    if (!unseen) {
        errno = ENOMEM;
        return -1;
    }
    unseen[0] = graph;

    fprintf(out, "digraph {\n");

    while (unseens > 0) {
        struct node *curr = unseen[--unseens];
        size_t       i, n;

        /* Already visited (since pushed on stack)? */
        if (curr->visited & VISITED_VISITED)
            continue;
        else
            curr->visited |= VISITED_VISITED;

        fprintf(out, "\tnode%p [ label=\"%s\" ];\n", curr, curr->name);

        for (i = 0, n = 0; i < curr->count; i++) {
            /* Count unvisited child nodes */
            n += !(curr->edges[i].target->visited & VISITED_STACKED);

            fprintf(out, "\tnode%p -> node%p [ taillabel=\"%.3f\" ];\n",
                    curr, curr->edges[i].target, curr->edges[i].weight);
        }

        if (n + unseens > unseens_max) {
            if (n + unseens > 1048576)
                unseens_max = ((n + unseens) | 1048575) + 1048573;
            else
            if (n + unseens < 2 * unseens_max)
                unseens_max = 2 * unseens_max;
            else
                unseens_max = 2 * (n + unseens);

            temp = realloc(unseen, unseens_max * sizeof *unseen);
            if (!temp) {
                free(unseen);
                errno = ENOMEM;
                return -1;
            } else
                unseen = temp;
        }

        /* Add unvisited child nodes to stack. */
        for (i = 0; i < curr->count; i++)
            if (!(curr->edges[i].target->visited & VISITED_STACKED)) {
                curr->edges[i].target->visited |= VISITED_STACKED;
                unseen[unseens++] = curr->edges[i].target;
            }
    }

    free(unseen);

    fprintf(out, "}\n");

    return 0;
}

In this case, the VISITED_STACKED bit mask indicates the node has already been added to the stack for later processing, and VISITED_VISITED bit mask indicates the node has been processed.

As Ovanes pointed out in a comment to this answer, for a very dense graph, you could use a c++ map or hashtable, especially if you often need to find out whether some pair of nodes share an edge or not. In this case, you can augment the above structure with an optional hash table of the target pointers.

Just for fun, I tested this in practice for directed weighted graphs with per-graph user-specified reallocation size functions and hashing functions, using the interface digraph.h:

#ifndef   DIGRAPH_H
#define   DIGRAPH_H
#include <stdlib.h>
#include <stdio.h>

struct digraph_node;

struct digraph_edge {
    struct digraph_edge  *hash_next;        /* Hash table slot chain */
    size_t                hash_code;        /* Hash value */
    struct digraph_node  *target;           /* Target edge of the node */
    double                weight;           /* Weight of this edge */
};

struct digraph_node {
    struct digraph_node  *hash_next;        /* Hash table slot chain */
    size_t                hash_code;        /* Hash value */
    struct digraph_edge  *edge;             /* Array of edges */
    size_t                edges;            /* Number of edges in this node */
    size_t                edges_max;        /* Number of edges allocated for */
    struct digraph_edge **hash_slot;        /* Optional edge hash table */
    size_t                hash_size;        /* Size of optional edge hash table */
    char                  name[];           /* Name of this node */
};

typedef struct {
    struct digraph_node **node;             /* Array of pointers to graph nodes */
    size_t                nodes;            /* Number of nodes in this graph */
    size_t                nodes_max;        /* Number of nodes allocated for */
    struct digraph_node **hash_slot;        /* Optional node hash table */
    size_t                hash_size;        /* Size of optional node hash table */
    /* Graph resize policy and hash function */
    size_t (*graph_nodes_max)(size_t nodes);
    size_t (*graph_hash_size)(size_t nodes);
    size_t (*graph_hash_func)(const char *name);
    /* Node resize policy and hash function */
    size_t (*node_edges_max)(size_t edges);
    size_t (*node_hash_size)(size_t edges);
    size_t (*node_hash_func)(struct digraph_node *target);
} digraph;

void digraph_init(digraph *graph);
void digraph_free(digraph *graph);

struct digraph_node *digraph_find_node(digraph *graph, const char *name);
struct digraph_edge *digraph_find_edge(digraph *graph, struct digraph_node *source, struct digraph_node *target);

struct digraph_node *digraph_add_node(digraph *graph, const char *name);
struct digraph_edge *digraph_add_edge(digraph *graph, struct digraph_node *source, struct digraph_node *target, double weight);

int digraph_dot(digraph *graph, FILE *out);

size_t digraph_default_graph_nodes_max(size_t nodes);
size_t digraph_default_graph_hash_size(size_t nodes);
size_t digraph_default_graph_hash_func(const char *name);
size_t digraph_default_node_edges_max(size_t edges);
size_t digraph_default_node_hash_size(size_t edges);
size_t digraph_default_node_hash_func(struct digraph_node *target);

#define  DIGRAPH_INIT  { NULL, 0, 0,                        \
                         NULL, 0,                           \
                         digraph_default_graph_nodes_max,   \
                         digraph_default_graph_hash_size,   \
                         digraph_default_graph_hash_func,   \
                         digraph_default_node_edges_max,    \
                         digraph_default_node_hash_size,    \
                         digraph_default_node_hash_func     \
                       }

#endif /* DIGRAPH_H */

For simplicity, the digraph_add_node() and digraph_add_edge() functions always set errno; to 0 if success, to EEXIST if such a node or edge already exists, or to an error code otherwise. If the node or edge already exists, the functions do return the existing one (but with errno set to EEXIST instead of 0). This makes adding new edges very easy.

On a laptop running 64-bit linux on an Intel Core i5-6200U processor, it took about 18 seconds to generate 5,000 random nodes, 12,500,000 random edges (2,500 nodes per edge), and describe the entire graph using GraphViz dot language. This suffices speed-wise for me, as I don't have any tools to visualize such graphs; even Graphviz completely chokes on these.

Note that because the graph structure contains an array of pointers to each node in the graph, no recursion is needed using the above structures:

int digraph_dot(digraph *graph, FILE *out)
{
    size_t  i, k;

    if (!graph)
        return errno = EINVAL;

    if (!out || ferror(out))
        return errno = EIO;

    fprintf(out, "digraph {\n");

    /* Describe all nodes. */
    for (i = 0; i < graph->nodes; i++)
        if (graph->node[i])
            fprintf(out, "\tnode%p [label=\"%s\"];\n",
                         (void *)graph->node[i], graph->node[i]->name);

    /* Describe all edges from all nodes. */
    for (i = 0; i < graph->nodes; i++)
        if (graph->node[i]) {
            if (graph->node[i]->edges) {
                for (k = 0; k < graph->node[i]->edges; k++)
                    fprintf(out, "\tnode%p -> node%p [taillabel=\"%.3f\"];\n",
                                 (void *)(graph->node[i]),
                                 (void *)(graph->node[i]->edge[k].target),
                                 graph->node[i]->edge[k].weight);
            } else {
                fprintf(out, "\tnode%p;\n", (void *)(graph->node[i]));
            }
        }

    fprintf(out, "}\n");

    if (fflush(out))
        return errno;
    if (ferror(out))
        return errno = EIO;

    return errno = 0;
}

Of course, if you have that dense graphs, with each node on average having an edge to half the other nodes in the graph, a weight matrix (with zero weight reserved for no edge, or a separate boolean matrix describing which edges do exist) would make a lot more sense, and would have much better cache locality, too.

When the matrix is stored in row-major order (for example, if you define a two-dimensional array in C), it makes sense to have each row correspond to one source node, and each column correspond to one target node, so that the vector describing the edges and/or weights from one node is consecutive in memory. It is much more common to examine the edges directed outwards from a specific source node, than examining the edges directed to a specific target node; thus, the more common case having better cache locality should help with overall performance.

In your first `struct`, should `size_t *child;` be `struct node **child;`? Also, the identifier `child` is only appropriate if the graph is a tree or a DAG. — Ian Abbott, Apr 18 '17 at 14:35
@ovaries: I think it is supposed to be for a dynamically allocated array of `struct node *`, which is why I think it should be declared as `struct node **child;`. — Ian Abbott, Apr 18 '17 at 17:13
@IanAbbott: Yes, it definitely should be `struct node **child;` -- a dynamically allocated array of child pointers. Thank you for the fix. — Nominal Animal, Apr 18 '17 at 19:10
@ovanes: While an array of nodes (`struct node *child`) would work if the node type does not have a C99 flexible array member, resizing the array using `realloc()` might mean their addresses would change, and that might mean a fixup pass across the graph to update the pointers. Resizing an array of pointers to child nodes (`struct node **child;`) only changes the address of the array, which means no fixup pass is ever needed (unless you otherwise change the address of an existing node). Plus, you can use a C99 flexible array member in `struct node;`, too. — Nominal Animal, Apr 18 '17 at 19:16
@IanAbbott: I understand, but actually, why do you need an array at all? One can just allocate a node on demand and append it to the linked list of edges and increment with every new pointing edge a reference counter in the vertex => edges as linked list. The problem with an array is that as soon you need to grow it, you must reallocate the entire memory, since the array guarantees contiguous memory. To speed up the append, one can prepend the new element at the head of edges. Anyway to find out if one node points to the other will be expensive => Thus I suggested either `set` or `hashtable`. — ovanes, Apr 18 '17 at 19:16
@NominalAnimal: We posted comments at the same time, but I thing that my previous comment also applies to your answer. In fact, in a dense graph you might end up with O(|V|) lookup times to check if given node `A` is connected to `B`. Thus usually often either `set` or `hashtable` are used to speed up this lookup. — ovanes, Apr 18 '17 at 19:27
@ovanes: Each separate allocation you do has some overhead, roughly a pointer or so, but possibly more, depending on the allocator implementation in the C library -- so 50% to 100% here. A linked list of child node pointers is also horrible for cache locality: slow. Thus, inefficient memory use and horribly slow child node scanning in general. Reallocation is more efficient, if you allocate in suitably increasing chunks instead (say for 8 at first, then double till say 65536 pointers, then in sets of 65536); that's why I have the `maxcount` separate from `count`. — Nominal Animal, Apr 18 '17 at 19:27
@ovanes: The lookup is exactly linear wrt. the number of edges in the both nodes, yes. It is not a performance issue until you get several dozen of child nodes, because a linear table lookup is so fast. However, you could trivially add a `struct hashtable_of_child_pointers *` member, that is (re)generated if the node has a sufficient number of child pointers to warrant the overhead. I've yet to need one, however: the lookup is usually better *"folded"* into an operation-specific hash table, or replaced by a scratch `visited` or `color` field in each node (for graph scanning). — Nominal Animal, Apr 18 '17 at 19:37
@NominalAnimal: Yes, you can do that. There is a way to end up with amortized constant time allocation, by e.g. doubling the size of the array, with every new allocation. Afterwards you suggest to use smth. like a deque in C++, to allocate memory in pages. I think one needs to measure it to be completely sure. Cache locality is more complex as you point it out IMO. If graph is big it will not fit into cache at all. And if graph is allocated once without modification, than the probability is high that nodes land in the same memory page and partially fit into cache than... — ovanes, Apr 18 '17 at 19:45
@ovanes: Just for fun, I added a find-edge and add-edge implementations that can use a hash table. (Can, because they do not need to, for small number of edges.) The hash table entries are chained (using a `struct edge *next;` member in the edge structure), so adding the hash table costs only the memory needed for the entry pointers. This could be made even better with a `struct graph` type, with pointers to policy functions (edge array resize, hash function, hash table size). It is easier in [tag:C++], but not that hard or odd in [tag:C] either. — Nominal Animal, Apr 18 '17 at 20:58
@NominalAnimal: +1, looks great! Actually, if you take a look a Boost Graph lib, they also have property_map. Using a hashtable(s) offers a non-intrusive approach in dealing with the graph. The `visited` flag should not be a part of graph :) This limits the scalability or maintainability. Let's say 2 different parts of the program want to inspect the graph (even not multithreaded) but one after the other. They will have a problem, or one needs to reset the `visited` flag it previously set in every vertex after the graph traversal, same applies to the other application part. — ovanes, Apr 18 '17 at 21:12
@ovanes: I changed the example a bit, omitting the `visited` flag, as adding an array of pointers to all nodes in a graph to the graph structure eliminates the need for recursion or a stack when describing the graph. I could post the implementation (digraph.c corresponding to the posted header digraph.h) and example random graph program I wrote in C99, with the Makefile -- but I'm not sure if it makes sense in the context of this question; I do not want to post code that helps people avoid studying or doing their homework. — Nominal Animal, Apr 19 '17 at 02:13

score 0 · Answer 2 · edited May 23 '17 at 12:34

If we are talking about graphs you should decide which graph you are trying to implement:

I assume from your question, that you need to implement the directed graph. Where a vertex (=node) Acan have a directed edge (=link) to vertex B, but not necessarily vice versa. In general you are looking to implement a Multigraph. Here you have two alternatives to represent such a graph:

Matrix, i.e. 2D array
Hashtable, i.e. Key represents the source vertex which is mapped to a list or set of destination vertices. Every such mapping (i.e. a pair of vertices) represents a directed edge.

Let's just have an example on how each of the approaches work. I will give you the major direction, so that you can train and implement it yourself. Otherwise, I might end up doing your homework :)

Given the following Graph G:

A -> B -> C    
^--D-^

E

Note here! E has no incoming or outgoing edges. This is still a valid graph.

Using a Matrix we can represent it as:

    A B C D E
   +-+-+-+-+-+
 A |0|1|0|0|0|
   +-+-+-+-+-+
 B |0|0|1|0|0|
   +-+-+-+-+-+
 C |0|0|0|0|0|
   +-+-+-+-+-+
 D |1|1|0|0|0|
   +-+-+-+-+-+
 E |0|0|0|0|0|
   +-+-+-+-+-+

Everywhere we have a number 0 we don't have an edge from vertex X to vertex Y. The number greater than one means in your case the number of edges from source vertex to the destination vertex.

An example: Does vertex C point to the vertex B (e.g. has a directed edge)?

Answer: Find the row C move to the column B if there is a number >0 than, yes. Otherwise, no. In our case the number is 0 => no connection.

If you need to represent a weighted connection you will probably end up with an array/list in each cell of the matrix. If array is empty (hint: one of the ways to implement 0 sized array) => there is no connection between to vertices, otherwise each element of the array represents the weight of the connection. Number of elements represent number of connections.

Also, be aware that in graph a vertex might have an edge to itself. Don't know whether this applies in your case.

Keep in mind, that Matrix is a very memory expensive representation. It allows you a very fast lookup, but you always require O(|V|^2) (|V| stands for the total number of vertices in a graph) elements to represent a graph with the matrix.

2D arrays are a well supported feature in C. Here a small example, on how to statically initialise a 5x5 matrix from the above case in C. For dynamic initialisation dig into C language a little bit (take a look at malloc, free):

#define ROWS 5
#define COLUMNS 5


int matrix[ROWS][COLUMNS] =
//           A  B  C  D  E
/* A */  { { 0, 1, 0, 0, 0 }
/* B */  , { 0, 0, 1, 0, 0 }
/* C */  , { 0, 0, 0, 0, 0 }
/* D */  , { 1, 1, 0, 0, 0 }
/* E */  , { 0, 0, 0, 0, 0 }
         }
;

Using the second approach (hashtable or dictionary) is a bit different.

You need a hashtable. AFAIK, C and C99 do not have support for hashtables. Thus you will need either to use one of existing open source implementations or implement a hashtable yourself. Your question does not question how to do the implementation in C++ where either a map or unordered_map (>= C++11) are part of the standard library. But still you ask what can be used instead of C. Thus the closes language at that point would be C++.

Here is an idea on how to do that: Let every vertex be a key in the Hashtable and it points to either a set of vertices where the connection exists or to the second Hashtable which builds up the weight of particular mapping.

   A -> (B,)
   B -> (C,)
   C -> ()
   D -> (A, B)
   E -> ()

in case with multiple connections you need to introduce something like:

   A -> {B -> 1,}
   B -> {C -> 1,}
   C -> {}
   D -> {A -> 1, B -> 1}
   E -> {}

Let's just state that A has 2 edges to B, than the structure is instantiated like:

   A -> {B -> 2,}
   B -> {C -> 1,}
   C -> {}
   D -> {A -> 1, B -> 1}
   E -> {}

Here if the graph is big and sparse you can save really a lot of memory.

The easiest implementation IMO would be in Python, as you ask what can be used instead of C. But you can really use C, C++, Java, Python, Ruby or any other language, to implement that kind of problem.

Any number of link from a single node to nodes in a linked list?

2 Answers2