Given the discussion diverged from the one on performance of networkx
to the optimal way to store "almost" complete graph I will concentrate on summarizing the rationale behind rather using tuples than frozenset
type for the dictionary key for the moment.
I was trying to find a confirmation for that but given slightly more methods frozenset
can take a bit more memory than tuple. From this question I learned that hashing algorithm has been reimplemented which helps for performance of dictionary inserts and lookups (which takes hash of the key on the way) but on the other hand Python is heavily optimized for tuples, lists, and strings of various lengths which makes me wonder if 2-tuples are still not faster than frozenset
if only for this reason.
Now, when we consider NumPy
arrays - the reason they may be better for the task is manifold:
- The memory is contiguous which helps dramatically with cache
locality (important thing when doing a traverse of the whole
array).
NumPy
is more optimal than ordinary lists for a larger scale of data (say tens of thousands of values).
- Individual values are stored more efficiently (see below for explanation).
In your case you seem to need to store 2 values - one float
, one int
. You could allocate 2 2-dim ndarray
s - one of int
and one of float32
type. You can fill in the array either diagonally and create a special accessor method (that would check both orders of indices - this may be slower) or fill in both indices (say: 1,2 and 2,1).
I am assuming that you do not require both values at all times therefore decoupling int
and float32
values would be actually beneficial for the performance of algorithms using respective values. The memory consumed by ndarray
s should be smaller and consecutive processing of the indices much faster than in case of a dictionary which quite randomly jumps around the memory.