How to vectorize a 'for' loop which calls a function (that takes a 2-Dimensional array as argument) over a 3-Dimensional numpy array

Question

I have a numpy array containing the XYZ coordinates of the k-neighboors (k=10) points from a point cloud:

k_neighboors
Out[53]: 
array([[[  2.51508147e-01,   5.60274944e-02,   1.98303187e+00],
        [  2.48552352e-01,   5.95569573e-02,   1.98319519e+00],
        [  2.56611764e-01,   5.36767729e-02,   1.98236740e+00],
        ..., 
        [  2.54520357e-01,   6.23480231e-02,   1.98255634e+00],
        [  2.57603496e-01,   5.19787706e-02,   1.98221457e+00],
        [  2.43914440e-01,   5.68424985e-02,   1.98352253e+00]],

       [[  9.72352773e-02,   2.06699912e-02,   1.99344850e+00],
        [  9.91205871e-02,   2.36056261e-02,   1.99329960e+00],
        [  9.59625840e-02,   1.71508361e-02,   1.99356234e+00],
        ..., 
        [  1.03216261e-01,   2.19752081e-02,   1.99304521e+00],
        [  9.65025574e-02,   1.44127617e-02,   1.99355054e+00],
        [  9.59930867e-02,   2.72080526e-02,   1.99344873e+00]],

       [[  1.76408485e-01,   2.81930678e-02,   1.98819435e+00],
        [  1.78670138e-01,   2.81904750e-02,   1.98804617e+00],
        [  1.80372953e-01,   3.05109434e-02,   1.98791444e+00],
        ..., 
        [  1.81960404e-01,   2.47725621e-02,   1.98785996e+00],
        [  1.74499243e-01,   3.50728296e-02,   1.98826015e+00],
        [  1.83470801e-01,   2.70808022e-02,   1.98774099e+00]],

       ..., 
       [[  1.78178743e-01,  -4.60980982e-02,  -1.98792374e+00],
        [  1.77953839e-01,  -4.73701134e-02,  -1.98792756e+00],
        [  1.77889392e-01,  -4.75468598e-02,  -1.98793030e+00],
        ..., 
        [  1.79924294e-01,  -5.08776568e-02,  -1.98772371e+00],
        [  1.76720902e-01,  -5.11409082e-02,  -1.98791265e+00],
        [  1.83644593e-01,  -4.64747548e-02,  -1.98756230e+00]],

       [[  2.00245917e-01,  -2.33091787e-03,  -1.98685515e+00],
        [  2.02384919e-01,  -5.60011715e-04,  -1.98673022e+00],
        [  1.97325528e-01,  -1.03301927e-03,  -1.98705769e+00],
        ..., 
        [  1.95464164e-01,  -6.23105839e-03,  -1.98713481e+00],
        [  1.98985338e-01,  -8.39920342e-03,  -1.98688531e+00],
        [  1.95959195e-01,   2.68006674e-03,  -1.98713303e+00]],

       [[  1.28851235e-01,  -3.24527062e-02,  -1.99127460e+00],
        [  1.26415789e-01,  -3.27731185e-02,  -1.99143147e+00],
        [  1.25985757e-01,  -3.24910432e-02,  -1.99146211e+00],
        ..., 
        [  1.28296465e-01,  -3.92388329e-02,  -1.99117136e+00],
        [  1.34895295e-01,  -3.64872888e-02,  -1.99083793e+00],
        [  1.29047096e-01,  -3.97952795e-02,  -1.99111152e+00]]])

With this shape:

k_neighboors.shape
Out[54]: (2999986, 10, 3)

And I have this function which applies a Principal Component Analysis to some data provided as 2-Dimensional array:

def PCA(data, correlation=False, sort=True):
    """ Applies Principal Component Analysis to the data
    
    Parameters
    ----------        
    data: array
        The array containing the data. The array must have NxM dimensions, where each
        of the N rows represents a different individual record and each of the M columns
        represents a different variable recorded for that individual record.
            array([
            [V11, ... , V1m],
            ...,
            [Vn1, ... , Vnm]])
    
    correlation(Optional) : bool
            Set the type of matrix to be computed (see Notes):
                If True compute the correlation matrix.
                If False(Default) compute the covariance matrix. 
                
    sort(Optional) : bool
            Set the order that the eigenvalues/vectors will have
                If True(Default) they will be sorted (from higher value to less).
                If False they won't.   
    Returns
    -------
    eigenvalues: (1,M) array
        The eigenvalues of the corresponding matrix.
        
    eigenvector: (M,M) array
        The eigenvectors of the corresponding matrix.
    
    Notes
    -----
    The correlation matrix is a better choice when there are different magnitudes
    representing the M variables. Use covariance matrix in any other case.
    
    """
    
    #: get the mean of all variables
    mean = np.mean(data, axis=0, dtype=np.float64)
    
    #: adjust the data by substracting the mean to each variable
    data_adjust = data - mean
    
    #: compute the covariance/correlation matrix
    #: the data is transposed due to np.cov/corrcoef sintaxis
    if correlation:
        matrix = np.corrcoef(data_adjust.T)
    else:
        matrix = np.cov(data_adjust.T) 
    
    #: get the eigenvalues and eigenvectors
    eigenvalues, eigenvectors = np.linalg.eig(matrix)
    
    if sort:
        #: sort eigenvalues and eigenvectors
        sort = eigenvalues.argsort()[::-1]
        eigenvalues = eigenvalues[sort]
        eigenvectors = eigenvectors[:,sort]
    
    return eigenvalues, eigenvectors

So the question is: how can I apply the PCA function mentioned above over each of the 2999986 10x3 arrays in a way that doesn't take for ever like this one:

data = np.empty((2999986, 3))

for i in range(len(k_neighboors)):
    w, v = PCA(k_neighboors[i])
    data[i] = v[:,2]
    break #:   I break the loop in order to don't have to wait for ever.


data
Out[64]: 
array([[ 0.10530792,  0.01028906,  0.99438643],
       [ 0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ],
       ..., 
       [ 0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ]])

[`Corrcoef vectorizing method`](http://stackoverflow.com/a/30143754/3293881). — Divakar, Apr 04 '16 at 19:32
Very close, if not a duplicate : [`Quickly compute eigenvectors for each element of an array in python`](http://stackoverflow.com/questions/35756952/quickly-compute-eigenvectors-for-each-element-of-an-array-in-python). Thus, mixing the earlier link with this should get you close, if not home. — Divakar, Apr 04 '16 at 19:34
note that np.linalg.eig accepts [..., m, n] as input shapes; that is, you can vectorize this call over all submatrices. and that's true for all the steps in your PCA function; each has a vectorized equivalent; you just need to write it out :) — Eelco Hoogendoorn, Apr 04 '16 at 19:37
Thaks for the comment @Dikavar, I found your np.einsum function quite awesome, thought the fact of understanding it exceed my poor programming knowledge. I end up with an (assumed) answer to my own question using your function, but I don't know if it's the right one. I would like to hear your opinion about my answer. — David de la Iglesia, Apr 05 '16 at 10:00
Thanks for pointing that @Eelco, I use it on my (assumed)answer. I would like to know your opinion about it(the answer). — David de la Iglesia, Apr 05 '16 at 10:03

score 0 · Accepted Answer · edited May 23 '17 at 11:53

Thanks to @Divakar and @Eelco comments.

Using the function that Divakar post on this answer

 def vectorized_app(data):            
        diffs = data - data.mean(1,keepdims=True)
        return np.einsum('ijk,ijl->ikl',diffs,diffs)/data.shape[1]

And using what Eelco pointed on his comment, I end up with this.

k_neighboors.shape
Out[48]: (2999986, 10, 3)

#: THE (ASSUMED)VECTORIZED ANSWER
data = np.linalg.eig(vectorized_app(k_neighboors))[1][:,:,2]

data
Out[50]: 
array([[ 0.10530792,  0.01028906,  0.99438643],
       [ 0.06462   ,  0.00944352,  0.99786526],
       [ 0.0654035 ,  0.00860751,  0.99782177],
       ..., 
       [-0.0632175 ,  0.01613551,  0.99786933],
       [-0.06449399,  0.00552943,  0.99790278],
       [-0.06081954,  0.01802078,  0.99798609]])

Wich gives the same results as the for loop, without taking forever (althought still takes a while):

data2 = np.empty((2999986, 3))

for i in range(len(k_neighboors)):
    if i > 10:
        break #:   I break the loop in order to don't have to wait for ever.
    w, v = PCA(k_neighboors[i])
    data2[i] = v[:,2]


data2
Out[52]: 
array([[ 0.10530792,  0.01028906,  0.99438643],
       [ 0.06462   ,  0.00944352,  0.99786526],
       [ 0.0654035 ,  0.00860751,  0.99782177],
       ..., 
       [ 0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ]])

I don't know if there could be a better way to do this, so I'm going to keep the question open.

looks good to me; I'm pretty sure this is your best bet. PCA isn't a cheap operation to begin with, so I would expect doing it 3M times to take some time. — Eelco Hoogendoorn, Apr 05 '16 at 11:08

How to vectorize a 'for' loop which calls a function (that takes a 2-Dimensional array as argument) over a 3-Dimensional numpy array

1 Answers1