12

I am using an adjacency matrix to represent a network of friends which can be visually interpreted as

Mary     0        1      1      1

Joe      1        0      1      1

Bob      1        1      0      1

Susan    1        1      1      0 

         Mary     Joe    Bob    Susan

Using this matrix, I want to compile a list of all possible friendship triangles with the condition that user 1 is friends with user 2, and user 2 is friends with user 3. For my list, it is not required that user 1 is friends with user 3.

(joe, mary, bob)
(joe, mary, susan)
(bob, mary, susan)
(bob, joe, susan)

I have a bit of code that works well with small triangles, but I need it to scale for very large sparse matrices.

from numpy import *
from scipy import *

def buildTriangles(G):
    # G is a sparse adjacency matrix
    start = time.time()
    ctr = 0
    G = G + G.T          # I do this to make sure it is symmetric
    triples = []
    for i in arange(G.shape[0] - 1):  # for each row but the last one
        J,J = G[i,:].nonzero()        # J: primary friends of user i
                                      # I do J,J because I do not care about the row values
        J = J[ J < i ]                # only computer the lower triangle to avoid repetition
        for j in J:
            K, buff = G[:,j].nonzero() # K: secondary friends of user i
            K = K[ K > i ]             # only compute below i to avoid repetition
            for k in K:
                ctr = ctr + 1
                triples.append( (i,j,k) )
    print("total number of triples: %d" % ctr)
    print("run time is %.2f" % (time.time() - start())
    return triples

I was able to run the code on a csr_matrix in approximately 21 minutes. The matrix was 1032570 x 1032570 and contained 88910 stored elements. There were a total of 2178893 triplets generated.

I need to be able to do something similar with a 1968654 x 1968654 sparse matrix with 9428596 stored elements.

I'm very new to python (little less than a month of experience) and not the greatest at linear algebra, which is why my code does not take advantage of matrices operations. Can anyone make any suggestions for improvement or let me know if my objective is even realistic?

will
  • 225
  • 1
  • 7
  • I don't think assigning the same value twice in a statement (`J,J=`) has any guaranteed meaning in Python. I find it very confusing and so do you, judging by your comment, so you might want to get rid of it. – Fred Foo Aug 03 '11 at 20:26
  • @larsmans My apologies. nonzero() returns the indices of a matrix as a 2-d array. Alternatively I could have done `row, col = G[i,:].nonzero()` and then `J = col`. I used the `J,J=` approach because I was concerned about memory usage and wanted to eat up the row array, since it wasn't needed. – will Aug 03 '11 at 20:38
  • 1
    Don't apologize, I didn't mean to be harsh. It's just not a Pythonic idiom and I think Guido is at lib to change the meaning of that construct between Python versions, so you can't rely on it working. It's better to `del` a variable if it really matters, although in this case `J = G[i, :].nonzero()[1]` will work too. – Fred Foo Aug 03 '11 at 20:44
  • Thanks for the suggestions. It definitely cleaned up the code a bit. The work you were doing with the Wikipedia articles is exactly what I am trying to do. I will look more into a linear algebra approach to the problem. – will Aug 03 '11 at 21:22

2 Answers2

6

I think you can find triangles only in rows or columns. for example:

Susan    1        1      1      0 
        Mary     Joe    Bob    Susan

this means Mary, Joe, Bob are all friends of Susan, so, use combinations to choose two person from [Mary, Joe, Bob], and combine it with Susan will get one triangle. itertools.combinations() do this quickly.

Here is the code:

import itertools
import numpy as np

G = np.array(   # clear half of the matrix first
    [[0,0,0,0],
     [1,0,0,0],
     [1,1,0,0],
     [1,1,1,0]])
triples = []     
for i in xrange(G.shape[0]):
    row = G[i,:]
    J = np.nonzero(row)[0].tolist() # combinations() with list is faster than NumPy array.
    for t1,t2 in itertools.combinations(J, 2):
        triples.append((i,t1,t2))
print triples
HYRY
  • 94,853
  • 25
  • 187
  • 187
  • Thanks for your answer. I haven't even considered that approach, but it makes a lot of sense. You basically reduce the problem to finding permutations of two. Would all the triples be unique? – will Aug 03 '11 at 23:40
  • @will: To clarify, do you mean that (Mary, Susan, Joe) and (Joe, Susan, Mary) are counted as distinct or identical? – Iterator Aug 04 '11 at 13:27
  • @Iterator I mean to count them as identical. I believe that this method does work in this regard. After looking at it further, I now realize that each new row is guaranteed to not have been in the earlier permutations. – will Aug 04 '11 at 16:32
  • +1 to user772649. This is great. I want to find this function in other languages I work in. I've always had to write it myself. – Iterator Aug 04 '11 at 17:00
4

Here's some suggestions for optimization:

K = K[ K > i ]             # only compute below i to avoid repetition
for k in K:
    ctr = ctr + 1
    triples.append( (i,j,k) )

Don't increment in a loop, it's terribly slow. Just ctr += K.shape[0] will do. Then, eliminate the most deeply nested loop altogether by replacing the append with

triples += ((i, j, k) for k in K[K > i])

Now, if you want real performance on this task, you will have to get into some linear algebra. "I want to compile a list of all possible friendship triangles" means that you want to square the adjacency matrix, which you can do with a simple **2.

Then realize that 1.968.654² means a very big matrix, and even though it's very sparse, its square will be much less so and will take a lot of memory. (I once tackled a similar problem where I considered links between Wikipedia articles at distance two, which took 20 minutes to solve, on a supercomputer cluster node, in C++. This is not a trivial problem. The Wikipedia adjacency matrix was a few orders of magnitude denser, though.)

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • When you mention "real performance" - can you elaborate on how you multiply two matrices and get a list (rather than a count) of 2-step pairings? – Iterator Aug 03 '11 at 21:13
  • @Iterator: Multiplying a square matrix with itself gives you a new matrix of the same rank that has value >0 for all *i*, *j* that are connected at step distance 2. Matrix multiplication is a heavily optimized operation in SciPy (implemented in C, I think, or maybe even Fortran). You can then extract the list yourself with much less searching in the matrix. – Fred Foo Aug 04 '11 at 09:56
  • Yes, you get the step 2 counts, which is what I said: you can get a count of the (i,*,k) pairs. The identities of the intermediate j nodes are lost. I understand (and stated) everything you said, but you haven't demonstrated any speedup of the naming of the full triplet. I think you're not thinking this through all the way. – Iterator Aug 04 '11 at 12:25