2

I have a python matrix

leafs = np.array([[1,2,3],[1,2,4],[2,3,4],[4,2,1]])

I would like to compute for each couple of rows the number of time they have the same element.

In this case I would get a 4x4 matrix proximity

proximity = array([[3, 2, 0, 1],
                   [2, 3, 1, 1],
                   [0, 1, 3, 0],
                   [1, 1, 0, 3]])

This is the code that I am currently using.

proximity = []

for i in range(n):
 print(i)
 proximity.append(np.apply_along_axis(lambda x: sum(x==leafs[i, :]), axis=1,
                                      arr=leafs))

I need a faster solution

EDIT: The accepted solution does not work in this example

    >>> type(f.leafs)
<class 'numpy.ndarray'>
>>> f.leafs.shape
(7210, 1000)
>>> f.leafs.dtype
dtype('int64')

>>> f.leafs.reshape(7210, 1, 1000) == f.leafs.reshape(1, 7210, 1000)
False
>>> f.leafs
array([[ 19,  32,  16, ..., 143, 194, 157],
       [ 19,  32,  16, ..., 143, 194, 157],
       [ 19,  32,  16, ..., 143, 194, 157],
       ..., 
       [139,  32,  16, ...,   5, 194, 157],
       [170,  32,  16, ...,   5, 194, 157],
       [170,  32,  16, ...,   5, 194, 157]])
>>> 
Donbeo
  • 17,067
  • 37
  • 114
  • 188
  • Itertools ([python2](https://docs.python.org/2/library/itertools.html) | [python3](https://docs.python.org/3/library/itertools.html)) should have something..? Try investigating this answer for [Generate all unique permutations of 2d array](http://stackoverflow.com/a/21960058/1762224) – Mr. Polywhirl Aug 05 '14 at 22:41
  • 1
    Can you be a bit more precise about what your output should be? Maybe I'm tired but for 'each couple of rows' I'm not seeing how a 4x3 maps to a 4x4 – sapi Aug 05 '14 at 22:42
  • The final matrix is squared and has the same number of rows of the initial one. Each row of the matrix is compared with all the others. `similarity[i,j] = sum(leafs[i,k] == leafs[j,k] for k in range(len(leafs[i,:]))' – Donbeo Aug 05 '14 at 22:47
  • What are the elements on the array? Integers, strings, floats? – ojy Aug 05 '14 at 22:50
  • The elements are integers. – Donbeo Aug 05 '14 at 22:58
  • How big are the actual arrays? – Warren Weckesser Aug 05 '14 at 23:03
  • 4000 rows and 100 or 1000 columns – Donbeo Aug 05 '14 at 23:04
  • How many distinct integers do you expect to have in your array? Are they all quite different, or there are some categories? – ojy Aug 05 '14 at 23:08
  • Edit suggestion: Instead of "for each couple of rows", it would be clearer to say "for each pair of rows". – Warren Weckesser Aug 05 '14 at 23:12
  • Strange. If `l1` and `l2` are normal numpy arrays of integers, `l1 == l2` should return an *array*, not `False`. If the result is too big, it should raise a `MemoryError` exception. Could you also show `l1.dtype` and `l2.dtype`? – Warren Weckesser Aug 29 '14 at 12:31

2 Answers2

5

Here's one way, using broadcasting. Be warned: the temporary array eq has shape (nrows, nrows, ncols), so if nrows is 4000 and ncols is 1000, eq will require 16GB of memory.

In [38]: leafs
Out[38]: 
array([[1, 2, 3],
       [1, 2, 4],
       [2, 3, 4],
       [4, 2, 1]])

In [39]: nrows, ncols = leafs.shape

In [40]: eq = leafs.reshape(nrows,1,ncols) == leafs.reshape(1,nrows,ncols)

In [41]: proximity = eq.sum(axis=-1)

In [42]: proximity
Out[42]: 
array([[3, 2, 0, 1],
       [2, 3, 1, 1],
       [0, 1, 3, 0],
       [1, 1, 0, 3]])

Also note that this solution is inefficient: proximity is symmetric, and the diagonal is always equal to ncols, but this solution computes the full array, so it does more than twice as much work as necessary.

Warren Weckesser
  • 110,654
  • 19
  • 194
  • 214
  • This is a fantastic solution and it is very very fast! Can you explain command 40 and 41 . Thanks! – Donbeo Aug 05 '14 at 23:14
  • @Donbeo, 4x1x3 against 1x4x3 => 4x4x3. [doc](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html) – Fabricator Aug 06 '14 at 00:05
  • @Warren Weckesser There is a case in which your solution does not work. I have edited the question. This can probably be due to an excessive use of memory – Donbeo Aug 29 '14 at 11:50
  • As I noted in my answer, I expected memory would be an issue if the the dimensions of `leafs` were large. Broadcasting+reduction is handy, but the the excessive memory use of intermediate results can be an issue. @ojy's answer is a reasonable way to break up the calculation into small steps. – Warren Weckesser Aug 29 '14 at 12:23
  • I have edited the question. I am just curious to know why this is not working. The ojy answer is probably better in my situation – Donbeo Aug 29 '14 at 13:57
1

Warren Weckesser offered a very beautiful solution using broadcasting. However, even a straightforward approach using a loop can have comparable performance. np.apply_along_axis is slow in your initial solution because it does not take advantage of vectorization. However the following fixes it:

def proximity_1(leafs):
    n = len(leafs)
    proximity = np.zeros((n,n))
    for i in range(n):
        proximity[i] = (leafs == leafs[i]).sum(1)  
    return proximity

You could also use a list comprehension to make the above code more concise. The difference is that np.apply_along_axis would loop through all the rows in a non-optimized manner, while leafs == leafs[i] will take advantage of numpy speed.

The solution from Warren Weckesser truly shows numpy's beauty. However, it includes the overhead of creating an intermediate 3-d array of size nrows*nrows*ncols. So if you have large data, the simple loop might be more efficient.

Here's an example. Below is code offered by Warren Weckesser, wrapped in a function. (I don't know what are the code copyright rules here, so I assume this reference is enough :))

def proximity_2(leafs):
    nrows, ncols = leafs.shape    
    eq = leafs.reshape(nrows,1,ncols) == leafs.reshape(1,nrows,ncols)
    proximity = eq.sum(axis=-1)  
    return proximity

Now let's evaluate the performance on an array of random integers of size 10000 x 100.

leafs = np.random.randint(1,100,(10000,100))
time proximity_1(leafs)
>> 28.6 s
time proximity_2(leafs) 
>> 35.4 s 

I ran both examples in an IPython environment on the same machine.

Mack
  • 2,614
  • 2
  • 21
  • 33
ojy
  • 2,402
  • 2
  • 18
  • 23