3

The problem: Storing dynamic adjacency list of a graph in a file while retaining O(1) algorithmic complexity of operations.

I am trying to store dynamic bidirectional graph in a file(s). Both nodes and edges can be added and removed and the operations must be O(1). My current design is:

File 1 - Nodes

Stores two integers per node (inserts are appends and removals use free list):

  • number of incoming edges
  • number of outgoing edges

File 2 - Edges

Stores 4 integers per edge (inserts are appends and removals use free list + swap with last edge for a node to update its new index):

  • from node (indice to File 1)
  • from index (i.e. third incoming edge)
  • to node (indice to File 1)
  • to index (i.e. second outgoing edge).

File 3 - Links

Serves as openly addressed hash table of locations of edges in File 2. Basically when you read a node from File 1 you know there are x incoming edges and y outgoing edges. With that you can go to File 3 to get the position of each of these edges in File 2. The key thus being:

  • index of node in File 1 (i.e. 0 for first node, 1 for second node)
  • 0 <= index of edge < number of outgoing/incoming edges

Example of File 3 keys if represented as chained hash table (that is unfortunately not suitable for files but would not require hashing...):

Keys (indices from `File 1` + 0 <= index < number of edgesfrom `File 1`, not actually stored)
1 | 0 1 2
2 | 0 1
3 |
4 | 0
5 | 0 1 2

I am using qHash and QPair to hash these atm however the number of collisions is very high. Especially when I compare it to single int hashing that is very efficient with qHash. Since the values stored are indices to yet another file probing is rather expensive so I would like to cut the number of collissions down.

Is there a specialized hashing algorithm or approach to use for pair of ints that could perform better in this situation? Or of course a different approach that would avoid this problem like how to implement chained hash table in a file for example (I can only think of using buffers but that would be overkill for sparse graphs like mine I believe)?

Resurrection
  • 3,916
  • 2
  • 34
  • 56
  • 1
    See http://stackoverflow.com/a/892640/56778, which gives a reasonably good way to hash an arbitrary number of integers. – Jim Mischel Jun 20 '16 at 16:12
  • @JimMischel Wow that really is a great hashing algorothm for my situation! Finally something that outperforms qHash. :-) Also by tweaking it a bit I managed to get even better results (I used 3 primes instead of 2). – Resurrection Jun 20 '16 at 17:17

1 Answers1

0

If you read through comments on this answer, they claim qHash of an int just returns that int unchanged (which is a fairly common way to hash integers for undemanding use in in-memory hash tables). So, using a strong general-purpose hash function will achieve a dramatic reduction in collisions, though you may loose out on some incidental caching benefits of having nearby keys more likely to hash to the same area on disk, so do measure rather than taking it for granted that fewer collisions means better performance. I also suggest trying boost::hash_combine to create an overall hash from multiple hash values (just using + or XOR is a very bad idea). Then, if you're reading from disk, there's probably some kind of page size - e.g. 4k, 8k - which you'll have to read in to access any data anywhere on that page, so if there's a collision it'd still be better to look elsewhere on the already-loaded page, rather than waiting to load another page from disk. Simple linear probing manages that a lot of the time, but you could improve on that further by wrapping back to the start of the page to ensure you've searching all of it before probing elsewhere.

Community
  • 1
  • 1
Tony Delroy
  • 102,968
  • 15
  • 177
  • 252
  • I have tested various hashing algorithms for pair of integers including now hash_combine yet qHash of QPair still seems to outperform all the others. With factor 2.1 it seems to produce no collissions at all which is very desirable (others need much higher factor). With lower factor the collissions go up rather rapidly. That is also true for all of about half a dozen algorithms I have tried. The point about page size is good one though and would definitely help mitigate the concern about too much disk reads. – Resurrection Jun 20 '16 at 06:49