networkx.read_edgelist uses much more memory than the file it loads

Question

I have a 3.7 GB edgelist file describing a complete graph on 20k nodes, where each edge has a float 'weight' (all 1.0) and an int 'length' (all 0–1000).

So the head of the edgelist file looks like this:

0 1 1.0 76
0 2 1.0 85
0 3 1.0 118
0 4 1.0 94
0 5 1.0 71
...

I'm loading it using:

def load_graph(file_path: str) -> Graph:
    return read_edgelist(file_path, nodetype=int,
                         data=[('weight', float),
                               ('length', int)])

But while networkx.read_edgelist is running, my computer grinds to a halt with nearly 100 GB of memory usage.

What gives? Is this peculiar to read_edgelist, or does a networkx.Graph just use a vast amount of memory? Either way, can anyone recommend an alternative graph library which operates with a smaller footprint?

Is there a reason you need it to be a graph? `D[(0,1)] = (1.0,76)` would be a much more efficient way to store the edge. Since it's a complete graph, it's not clear to me that there is much advantage in actually storing it as a graph rather than just storing it as a dict with entries fro each edge. — Joel, Jun 28 '18 at 17:21
The thing the graph is buying me is the fact that the edges are undirected. So (0,1) is the same as (1,0). Is there a way to use tuples (or something similar) as dictionary keys but without caring about the order of their entries? This would allow me to drop networkx. — Cai, Jun 29 '18 at 08:36
Ok, an answer to this question https://stackoverflow.com/q/41259493/2883198 says to use `frozenset`s as keys, so I'll try that. — Cai, Jun 29 '18 at 09:13
Another approach would be to have a getter function that would always check the tuple key in order of indices (`frozenset` has some overhead and with this data size it is worth optimizing). Also, for processing it would make sense for me to use a more memory compact storage - e.g. `NumPy` array. — sophros, Jul 02 '18 at 13:11
@sophros Thanks for your comment, but I didn't quite follow — could you be a little more explicit? What is the overhead with frozenset and how could I get around that? Would a numpy array be more efficient than a frozenset-keyed dict of namedtuples (which is what I'm now using)? I should be clear that the graph is "almost" complete, but I can't rely on that fact when e.g. getting lists of incident edges. Right now I use a comprehension `n` -> `set(e for e in self.edges if n in edge)` where `self.edges` is the dict I mentioned. I'd ask a new SO question, but I'm not sure exactly what to ask! — Cai, Jul 03 '18 at 08:43

score 1 · Accepted Answer · answered Jul 03 '18 at 09:48

Given the discussion diverged from the one on performance of networkx to the optimal way to store "almost" complete graph I will concentrate on summarizing the rationale behind rather using tuples than frozenset type for the dictionary key for the moment.

I was trying to find a confirmation for that but given slightly more methods frozenset can take a bit more memory than tuple. From this question I learned that hashing algorithm has been reimplemented which helps for performance of dictionary inserts and lookups (which takes hash of the key on the way) but on the other hand Python is heavily optimized for tuples, lists, and strings of various lengths which makes me wonder if 2-tuples are still not faster than frozenset if only for this reason.

Now, when we consider NumPy arrays - the reason they may be better for the task is manifold:

The memory is contiguous which helps dramatically with cache locality (important thing when doing a traverse of the whole array).
NumPy is more optimal than ordinary lists for a larger scale of data (say tens of thousands of values).
Individual values are stored more efficiently (see below for explanation).

In your case you seem to need to store 2 values - one float, one int. You could allocate 2 2-dim ndarrays - one of int and one of float32 type. You can fill in the array either diagonally and create a special accessor method (that would check both orders of indices - this may be slower) or fill in both indices (say: 1,2 and 2,1).

I am assuming that you do not require both values at all times therefore decoupling int and float32 values would be actually beneficial for the performance of algorithms using respective values. The memory consumed by ndarrays should be smaller and consecutive processing of the indices much faster than in case of a dictionary which quite randomly jumps around the memory.

@Cai - did it help you? If so, could you please accept the answer? — sophros, Aug 07 '18 at 09:35
Ok, I have been working on this for a long time. In the end, for my purposes it seems like an edge class which inherits from `tuple` works the best. I think using numpy arrays will be less useful for testing node membership in an edge (unless I do something very clever), but I think you're right that `tuples` are the right thing, not `frozenset`s. `frozenset`s just use too much memory compared to tuples. — Cai, Aug 28 '18 at 15:10
In case it helps anyone else, I made `class Edge(tuple): def __new__(self, seq=()): return tuple.__new__(tuple, sorted(seq))`, so `Edge((0, 1)) == Edge((1, 0))` is `True`. I also added an `assert len(seq) == 2` into `__new__`, but that may be unnecessary. — Cai, Aug 28 '18 at 15:13

networkx.read_edgelist uses much more memory than the file it loads

1 Answers1