1

I have a symmetric function get_corr which consumes two strings, and returns a double.

def get_corr(id1, id2):
    # some magic to find double x
    #...
    return x

I also have a list of strings factors, with which I want to generate a symmetric matrix of size len(factors)xlen(factors) by applying get_corr to the cartesian product of factors with itself.

This would be very easy by just using nested for loops, iterating over the indices of factors to call get_corr for each position.

corr_matr = np.identity(factor_length)
for i in factor_length:
    for j in factor_length:
        corr_matr[i,j] = corr_matr[j,i] = get_corr(factors[i], factors[j])

But I felt like there must be some syntactic NumPy sugar for this - is there? I don't reckon it can be any faster but perhaps I'm wrong. Nested for loops for this purpose seems like it just probably isn't necessary. I attempted to use np.frompyfunc and call that on the itertools.product, but this seems worse really because I'll be calling get_corr twice as many times. Furthermore, I could not vectorize the function properly with the tuple sequence produced by itertools.product.

Eric Hansen
  • 1,749
  • 2
  • 19
  • 39

2 Answers2

3

To my knowledge, there is nothing in numpy to improve that performance. Numpy is very fast once you make a numeric array. If what you have is a list of strings and a mapping function string -> double, then you will have to iterate over the entries.

One option would be to translate your mapping to Cython, and write the conversion in there, that will probably speed up things.

If you want to stick to python code, itertools has some interesting tools. As you mention, product probably wont give any improvement, as you have to do the map calculations twice (and it is symmetric). However, combinations and combinations_with_replacement seem like good options to me.

If your correlation function will always map the autocorrelation to 1 (get_corr(i, i) = 1) then use combinations, as it will ignore diagonal elements, if not, use combinations_with_replacement.


Let me define a dummy correlation-like mapping function of strings -> double:

def get_corr(id1, id2):
    diff = len(id1) - len(id2)
    return 1. / (1. + diff * diff)

The function is both symmetric and measures similarity (1 for strings with same length, < 1 for different ones).

A string generator function (following random strings):

def random_strings(N, R):
    return [''.join(choice(string.ascii_uppercase + string.digits) 
                    for _ in range(randint(1, R)))
            for _ in range(N)]

and a couple of test functions, yours:

def test1(data):
    N = len(data)
    corr_matr = np.identity(N)
    for i in xrange(N):
        for j in xrange(N):
            corr_matr[i,j] = corr_matr[j,i] = get_corr(data[i], data[j])
    return corr_matr

and using combinations:

def test2(data):
    N = len(data)
    corr_matr = np.identity(N)
    for (i, j) in combinations(xrange(N), 2):
        corr_matr[i,j] = corr_matr[j,i] = get_corr(data[i], data[j])
    return corr_matr

Now a bit of benchmarking with 100 random strings:

>>> data = random_strings(100, 10) # 100 random strings
>>> %timeit -n3 test1(data)
3 loops, best of 3: 5.24 ms per loop
>>> %timeit -n3 test2(data)
3 loops, best of 3: 2.29 ms per loop

And 1000 random strings:

>>> data = random_strings(1000, 10) # 1000 random strings
>>> %timeit -n3 test1(data)
3 loops, best of 3: 452 ms per loop
>>> %timeit -n3 test2(data)
3 loops, best of 3: 232 ms per loop

Using itertools (with a fairly simple mapping function) is twice as fast.

Imanol Luengo
  • 15,366
  • 2
  • 49
  • 67
2

Your code iterates the full N*N times, and does the assignment twice.

corr_matr = np.identity(factor_length)
for i in factor_length:
    for j in factor_length:
        corr_matr[i,j] = corr_matr[j,i] = get_corr(factors[i], factors[j])

It would work just as well with corr_matr[i,j] = get_corr(factors[i], factors[j]) since the [j,i] element is also calculated.

You could throw in a conditional, saving some get_corr calls

if j>i:
    corr_matr[i,j] = corr_matr[j,i] = get_corr(factors[i], factors[j])

There's a set of np.tri... functions that give you indices for upper or lower triangles of an array. np.tri is the base one used by the others that returns an array of 1 and 0s

In [169]: np.tri(4)
Out[169]: 
array([[ 1.,  0.,  0.,  0.],
       [ 1.,  1.,  0.,  0.],
       [ 1.,  1.,  1.,  0.],
       [ 1.,  1.,  1.,  1.]])

With the where indices of this array:

In [171]: np.tril_indices(4,4)
Out[171]: 
(array([0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3], dtype=int32),
 array([0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3], dtype=int32))

you could do a flattened iteration, the equivalent to the conditional above.

for i,j in zip(*np.tril_indices(4,4)):
    print(i,j)

There are SO questions that discuss filling in a symmetric array given the triupper (or lower) values as a flat array.

Simply summing an upper array with its transpose works, though it may require removing the duplicated diagonal.

In [175]: x=np.tri(4)

In [176]: x+x.T*2
Out[176]: 
array([[ 3.,  2.,  2.,  2.],
       [ 1.,  3.,  2.,  2.],
       [ 1.,  1.,  3.,  2.],
       [ 1.,  1.,  1.,  3.]])

If you have to perform the get_corr function on (scalar) pairs, your double assignment is probably as fast, if not faster than this sort of upper to lower copy after the fact. But that can be timed.

hpaulj
  • 221,503
  • 14
  • 230
  • 353