1

I am working with the affinity propagation algorithm and I want to write it from scratch without using scikit-learn. I have written the responsibility and availability matrices with nested for loops or list comprehension but the execution time of each one is more than 30 minutes with data of more than 2000 individuals.

from scipy.spatial.distance import euclidean, pdist, squareform
import numpy as np
import pandas as pd

def similarity_func(u, v):
    return -euclidean(u,v)

csv_data = pd.read_csv("DadosC.csv", delimiter=",", encoding="utf8", engine="python")
X = csv_data[{"InicialL", "FinalL"}].to_numpy().copy()

dists = pdist(X, similarity_func)    
distM = np.array(squareform(dists)) 
np.fill_diagonal(distM, np.median(distM))  
distM

A = np.zeros((X.shape[0],X.shape[0]))

def Respo(A,S,i,j):
    a_0 = np.delete(A[i],j)                  
    s_0 = np.delete(S[i],j)                   
    return S[i][j]-max(a_0+s_0)

Lis = [[Respo(A,distM,i,j) for i in range(X.shape[0])] for j in range(X.shape[0])]
Res = np.reshape(Lis,(X.shape[0],X.shape[0])).T

This is what I have, A and S are 2000x2000 array, A is initialized as null but is then updated with a similar function. When X is a 2000x2 array it takes too long to calculate. What alternative can you think of?

Nelson
  • 11
  • 2

3 Answers3

2

SciKit-Learn is a wrapper around C-libraries which allows for multi-threading and faster execution speed in general. It also has many built in optimizations...

If you are trying to compete with the speed of SciKit-Learn, you will probably need to code in C instead of Python.

spencer.pinegar
  • 442
  • 2
  • 10
1

Python is not designed with execution performance in mind and the vast majority of SciKit are wrappers for highly matured C modules. While the intent is admirable I would recommend becoming more familiar with how this library works in the context of transforming python data types into C equivalents.

In order to compete with the library's speed you would first want to understand how they are calling and organizing data structures and then you would have to try and improve upon them not impossible but highly unlikely given the quote from the FAQ page:

We only consider well-established algorithms for inclusion. A rule of thumb is at least 3 years since publication, 200+ citations, and wide use and usefulness. A technique that provides a clear-cut improvement (e.g. an enhanced data structure or a more efficient approximation technique) on a widely-used method will also be considered for inclusion.

  • 1
    Python by itself is certainly never going to give you comparable performance, but often, you can get somewhere close using `numpy` and `scipy` for your numerical, array manipulation. – juanpa.arrivillaga Aug 11 '21 at 18:32
  • For sure this certainly wasn't a knock against python I've used it for heavy RF computational analysis but rather just a general statement about the performance of the language. I feel for these types of use cases like OPs `numpy` and `scipy` are completely valid and often preferable if you aren't as familiar with lower level languages like C. – ProgrammerPterosaur Aug 12 '21 at 03:48
  • Oh sure. I would also add, `numba` has really changed the game if you want to write performant, numerical code in just Python without having to write a C-extension. Lots of things aren't amenable to vectorized operations, especially if you want to maintain some space efficiency. – juanpa.arrivillaga Aug 12 '21 at 03:52
-1

You can try to use the built-in map function, I would be able to further demonstrate had you posted 'A' and 'distM' but feel free to look into the function. https://docs.python.org/3/library/functions.html#map Generally when ever map is applicable it is preferred since looping in python isn't as fast.

Ofek Glick
  • 999
  • 2
  • 9
  • 20
  • `map` is generally not faster than a loop it essentially is a python level for loop. And generally, list comprehensions are preferred over `map`, they are often faster if you avoid the function call, but regardless, speed differences would be marginal – juanpa.arrivillaga Aug 11 '21 at 17:13
  • You intrigued me, as I was sure map was generally faster than a list comprehension. Anyway, as answered here: https://stackoverflow.com/questions/1247486/list-comprehension-vs-map map my have a slight advantage when not using lambda functions (as he is planning too). In any case, as the question was how to avoid loops and list comprehensions I would try to make map work. – Ofek Glick Aug 12 '21 at 21:19
  • Essentially, the issue is that if you do `[f(x) for x in whatever]`, `f` is looked up every time, this is particularly costly if `f` is a global/built-in (equivalent to a couple hash lookups). It should be faster if it is a local lookup. But again, all of these are very minor differences, and usually, list comprehension wins out when you are using some expression `[ for x in whatever]` compared to `map(lambda x: , whatever)` because in this case, you are avoiding the function call overhead – juanpa.arrivillaga Aug 12 '21 at 21:33
  • But these are all minor variations. The point is, for-loops, list comprehensions, and `map` are all essentially doing a Python-level for-loop and will be of comparable speed, the primary difference is *stylistic*. – juanpa.arrivillaga Aug 12 '21 at 21:34