2

I am working on software that deals with sparse matrices. They're not huge (anywhere from 15x15 to ~300x300). I'd like to be able to store a representation of the matrix in a short string so that I can store it as a value in a CSV file (along with many other things).

What I've tried so far is to treat the matrix as a binary string, and convert to base62:

import numpy as np
import networkx as nx

def graphToHash(a,numnodes):
  def baseN(num,b,numerals="0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"):
      return ((num == 0) and numerals[0]) or (baseN(num // b, b, numerals).lstrip(numerals[0]) + numerals[num % b])
  return str(numnodes) + '!' + baseN(int(''.join([str(i) for i in flatten_list(a)]),2), 62)

def flatten_list(l):
    l1=[item for sublist in l if isinstance(sublist,list) or isinstance(sublist,np.ndarray) for item in sublist]
    l=l1+[item for item in l if not isinstance(item,list) and not isinstance(item,np.ndarray)]
    return l

# example
import sys
sys.setrecursionlimit(10000)
a=np.array(nx.to_numpy_matrix(nx.connected_watts_strogatz_graph(160,8,.3,1000))).astype(int)
hash=graphToHash(a,160)
len(hash) # ~4300 characters

This works fine for small graphs (30 nodes is ~150 characters). However larger graphs are bit unwieldy (160 nodes is ~4300 and requires me to increase the recursion limit).

Because the graph is binary and sparse, I know I can do better. Ideally I'd like to continue using strings that are {0-9,a-z,A-Z} because I know these will not pose any issues in my CSV file.

What is the most efficient way to compress a binary sparse matrix?

Community
  • 1
  • 1
Jeff
  • 12,147
  • 10
  • 51
  • 87
  • It actually works fine, thanks. Still looking for a better method though. – Jeff Nov 08 '16 at 18:05
  • I could, but it wouldn't make the strings any shorter. I've read that raising the recursion limit is bad, but I'm not sure why, so that's not my primary concern. – Jeff Nov 08 '16 at 18:14
  • Could you just `cpickle.dumps` the object into a string and then `zlib.compress`, afterwards, then encode as base62? – Dr. V Nov 08 '16 at 18:19
  • if you're going to limit yourself to alphanumeric, you might as well go the way of readability and ease of encoding/decoding and literally just use decimal 0-9. If you actually want compression and reliability you should give up on those restrictions – Aaron Nov 08 '16 at 18:19
  • You need to provide more information on what you are trying to compress. What is it exactly? – Mark Adler Nov 08 '16 at 18:27
  • @Aaron using base 62 cuts the string length in half (compared to decimal), so it's better compression. And limiting myself to alphanumeric does not decrease reliability. – Jeff Nov 08 '16 at 18:30
  • @MarkAdler I'm trying to compress a binary sparse matrix. – Jeff Nov 08 '16 at 18:30
  • @Dr.V `zlib(cpickle(graph))` is not a large improvement over what I have now (~3700 characters), and converting to base62 would only make that worse. – Jeff Nov 08 '16 at 18:36
  • Can you post a tiny example of your graph/matrix objects? Or is it a numpy object documented somewhere? – Dr. V Nov 08 '16 at 18:41
  • @Jeff `it doesn't decrease reliability`..... `requires me to increase the recursion limit` I agree it does throw out some of the space savings, but imao the best way to encode a list of ints for storage is straight to utf8. If you really wanna save space, encode your whole csv.. – Aaron Nov 08 '16 at 18:42
  • for example: `''.join([unichr(x).encode('utf-8') for x in listofints])` – Aaron Nov 08 '16 at 18:45
  • @Dr.V I edited my post to provide full reproducible code, but you'll need the networkx package installed, sorry – Jeff Nov 08 '16 at 18:51
  • Yeah, I don't have it or install it now, but now I assume it is a list of coordinate tuples (indicating the ones in the matrix) as Leon processed, right? – Dr. V Nov 08 '16 at 18:54
  • @Dr.V `a` is an nxn link matrix, but I can easily convert to and from a list of tuples. Either input representation is fine. – Jeff Nov 08 '16 at 18:58
  • 1
    Ok, I got part of what I was looking for in another comment 4-5% filled. The other questions are a) is the diagonal all zeros, i.e. is it a simple graph? b) is the matrix symmetric, i.e. if a graph, is it directed or undirected? Can you include the example 160x160, 4-5% filled matrix in the question? – Mark Adler Nov 08 '16 at 19:46
  • @MarkAdler I increased the density of the example graph so that it's ~5% filled. Very good point, my graphs are (for now) undirected (symmetric) with no self-loops (0 diagonal)... So I think I can use the upper triangle with Aaron's method and cut the size in half – Jeff Nov 08 '16 at 20:08

3 Answers3

3

How about using the sparse6 format? It uses printable ASCII characters. http://users.cecs.anu.edu.au/~bdm/data/formats.txt

import networkx as nx
G = nx.connected_watts_strogatz_graph(160,4,.3,1000)
s = nx.generate_sparse6(G)
print(len(s))
print(s)

505
>>sparse6<<:~?A__O??K@?SA?[B__D_kE?{F@CH`KI@[J`_L`gM`{NACOaGQ`?QA[R_oIAcTAsUA{VBCWBKXBSHBOZbW[ac\BsSBo^cO_CKacOb_?bCcccgedGfd?hdSiD[jDcldsmdwo_GTE?peGqe[rEcsEktb_^E{vcGwfGyfg{d_{FkAFg}_ObFs~gKgFH@GS\DhAG\BglDGtEG{AG|GhSmHPJhXKhkuHlNiO?CPOIKMILQI\RIdSIlTItUI|VJDMJ@XjHYbwjJPZcxZ_WPc`\Js`FX^_`__`CKL`k\bKc\IXcKldKsbDpfk|WLLhdPeLPjl[nHHllhmf`fLpnl{lMLCMLqM[?Eprm`tmtuM|vNDwNLxNTyN\rN[mKH{i@|n{~LP~OCeDI?oUA_YBOeBOcMOiEosjOyGpAHghZPUIh@sNYKeIKhqMpiNcwzPyO`QOQMS??~QQR`iRql{QsXQqVpqVjhmREXRSyR\NNqZRc?@a[Rk@Jq]
Aric
  • 24,511
  • 5
  • 78
  • 77
  • networkx has moved `nx.generate_sparse6` to `nx.readwrite.sparse6.to_sparse6_bytes`. i think this happened some time during the networkx 2.0 update – steviestickman May 16 '21 at 02:01
  • commit that removed generate_sparse6 https://github.com/networkx/networkx/commit/5c22e401bdf5fdff19147bd1328a934f8d3205b5 – steviestickman May 16 '21 at 02:08
2

After our long discussion in the comments, I remembered that this is a binary array... derp run length encoding:

def brle(decoded): #binary run length encoding
    run = 0
    encoded = []
    for i in decoded:
        if i:
            encoded.append(run)
            run = 0
        else:
            run += 1
    return encoded

def brld(encoded): #binary run length decoding
    decoded = np.zeros(sum(encoded)+len(encoded)+1) #random trickery to get original length of flat list
    pos = 0
    for run in encoded:
        pos += run
        decoded[pos] = 1
        pos += 1
    return decoded

without any alphanumeric encoding...

a=np.array(nx.to_numpy_matrix(nx.connected_watts_strogatz_graph(160,4,.3,1000))).astype(int)
b = flatten_list(a)
encoded = brle(b)

len(';'.join([str(x) for x in encoded])) # ==1706 chars

c = brld(encoded)
assert(all(b==c)) # passes

with utf-8 encoding:

s = ''.join(unichr(x).encode('utf-8') for x in encoded) #711 bytes in memory
assert(encoded == [ord(x) for x in s.decode('utf-8')]) # passes
Aaron
  • 10,133
  • 1
  • 24
  • 40
  • Great :) Not sure utf-8 encoding is reversible without a delimiter, but that's ok-- still a big improvement – Jeff Nov 08 '16 at 19:35
  • each utf-8 char is a single number.. you don't need a delimiter... I add the decoding to my example – Aaron Nov 08 '16 at 19:41
  • a single utf-8 char can have an ordinal in `range(2**21)` (up to 4 bytes) Provisions exist for 5 and 6 byte characters, but I'm not sure if python implements that (originally reserved for future compat) – Aaron Nov 08 '16 at 19:47
  • I used some similar techniques to get better compression on a code golf answer I wrote up a while ago.. http://codegolf.stackexchange.com/a/92814/42652 – Aaron Nov 08 '16 at 19:56
  • Ah, good point. I might stick with base62 anyway... UTF-8 doesn't seem to store nicely in CSV (i.e., problems when reading into R or Excel). Still a big improvement. – Jeff Nov 08 '16 at 20:03
  • encoding raw values in utf-8 is best for when you're writing both the program that does the encoding and the decoding.. Excel would almost definitely hate you for that – Aaron Nov 08 '16 at 20:10
  • Although I will decode the graphs sometimes (in Python), another more frequent purpose is to check whether two graphs are identical (comparing strings). Not surprised at Excel, but yeah R hates it too. – Jeff Nov 08 '16 at 20:22
0

Since the graph is expected to be sparse, I would encode its adjacency-list-based representation. Something like this (note that I reused your version of baseN(), but I would replace it with an iterative version):

#!/usr/bin/env python3

def baseN(num,b,numerals="0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"):
      return ((num == 0) and numerals[0]) or (baseN(num // b, b, numerals).lstrip(numerals[0]) + numerals[num % b])

def encode_graph(g):
    # the leading 'a' is needed to protect the leading zero (if any)
    s = 'a' + 'a'.join(['a'.join(map(str,x)) for x in g])
    n = int(s, 11)
    return baseN(n, 62)

print(encode_graph([(0,1), (1,5), (1,23), (5,23)])) # outputs 64wc3BssnTd
Leon
  • 31,443
  • 4
  • 72
  • 97
  • Thanks, but unfortunately this is slightly worse than my existing solution. It may work well for graphs that are even more sparse though. The graph I tried is 160x160 and ~4-5% filled – Jeff Nov 08 '16 at 18:42