Get all component stats of multiple arrays labeled by one of them

Question

I already asked a similar question which got answered but now this is more in detail:

I need a really fast way to get all important component stats of two arrays, where one array is labeled by opencv2 and gives the component areas for both arrays. The stats for all components masked on the two arrays should then saved to a dictionary. My approach works but it is much too slow. Is there something to avoid the loop or a better approach then the ndimage.öabeled_comprehension?

from scipy import ndimage
import numpy as np
import cv2

def calculateMeanMaxMin(val):
    return np.array([np.mean(val),np.max(val),np.min(val)])

def getTheStatsForComponents(array1,array2):
    ret, thresholded= cv2.threshold(array2, 120, 255, cv2.THRESH_BINARY)
    thresholded= thresholded.astype(np.uint8)
    numLabels, labels, stats, centroids = cv2.connectedComponentsWithStats(thresholded, 8, cv2.CV_8UC1)
    allComponentStats=[]
    meanmaxminArray2 = ndimage.labeled_comprehension(array2, labels, np.arange(1, numLabels+1), calculateMeanMaxMin, np.ndarray, 0)
    meanmaxminArray1 = ndimage.labeled_comprehension(array1, labels, np.arange(1, numLabels+1), calculateMeanMaxMin, np.ndarray, 0)
    for position, label in enumerate(range(1, numLabels)):
        currentLabel = np.uint8(labels== label)
        contour, _ = cv2.findContours(currentLabel, cv2.RETR_LIST, cv2.CHAIN_APPROX_NONE)
        (side1,side2)=cv2.minAreaRect(contour[0])[1]
        componentStat = stats[label]
        allstats = {'position':centroids[label,:],'area':componentStat[4],'height':componentStat[3],
                              'width':componentStat[2],'meanArray1':meanmaxminArray1[position][0],'maxArray1':meanmaxminArray1[position][1],
                              'minArray1':meanmaxminArray1[position][2],'meanArray2':meanmaxminArray2[position][0],'maxArray2':meanmaxminArray2[position][1],
                              'minArray2':meanmaxminArray2[position][2]}

        if side1 >= side2 and side1 > 0:
            allstats['elongation'] = np.float32(side2 / side1)
        elif side2 > side1 and side2 > 0:
            allstats['elongation'] = np.float32(side1 / side2)
        else:
            allstats['elongation'] = np.float32(0)
        allComponentStats.append(allstats)
    return allComponentStats

EDIT

The two arrays are 2d arrays:

array1= np.random.choice(255,(512,512)).astype(np.uint8)
array2= np.random.choice(255,(512,512)).astype(np.uint8)

EDIT2

small example of two arrays and the labelArray with two components(1 and 2, and background 0). Calculate the min,max mean with ndimage.labeled_comprhension.

from scipy import ndimage
import numpy as np

labelArray = np.array([[0,1,1,1],[2,2,1,1],[2,2,0,1]])
data = np.array([[0.1,0.2,0.99,0.2],[0.34,0.43,0.87,0.33],[0.22,0.53,0.1,0.456]])
data2 = np.array([[0.1,0.2,0.99,0.2],[0.1,0.2,0.99,0.2],[0.1,0.2,0.99,0.2]])
numLabels = 2

minimumDataForAllLabels = ndimage.labeled_comprehension(data, labelArray, np.arange(1, numLabels+1), np.min, np.ndarray, 0)
minimumData2ForallLabels = ndimage.labeled_comprehension(data2, labelArray, np.arange(1, numLabels+1), np.min, np.ndarray, 0)
print(minimumDataForAllLabels)
print(minimumData2ForallLabels)
print(bin_and_do_simple_stats(labelArray.flatten(),data.flatten()))

Output:

[0.2 0.22] ##minimum of component 1 and 2 from data
[0.2 0.1] ##minimum of component 1 and 2 from data2
[0.1  0.2  0.22] ##minimum output of bin_and_do_simple_stats from data

score 1 · Answer 1 · answered Aug 14 '19 at 19:10

labeled_comprehension is definitely slow.

At least the simple stats can be done much faster based on the linked post. For simplicity I'm only doing one data array, but as the procedure returns sort indices it can be easily extended to multiple arrays:

import numpy as np    
from scipy import sparse
try:
    from stb_pthr import sort_to_bins as _stb_pthr
    HAVE_PYTHRAN = True
except:
    HAVE_PYTHRAN = False

# fallback if pythran not available

def sort_to_bins_sparse(idx, data, mx=-1):
    if mx==-1:
        mx = idx.max() + 1    
    aux = sparse.csr_matrix((data, idx, np.arange(len(idx)+1)), (len(idx), mx)).tocsc()
    return aux.data, aux.indices, aux.indptr

def sort_to_bins_pythran(idx, data, mx=-1):
    indices, indptr = _stb_pthr(idx, mx)
    return data[indices], indices, indptr

# pick best available

sort_to_bins = sort_to_bins_pythran if HAVE_PYTHRAN else sort_to_bins_sparse

# example data

idx = np.random.randint(0,10,(100000))
data = np.random.random(100000)

# if possible compare the two methods

if HAVE_PYTHRAN:
    dsp,isp,psp = sort_to_bins_sparse(idx,data)
    dph,iph,pph = sort_to_bins_pythran(idx,data)

    assert (dsp==dph).all()
    assert (isp==iph).all()
    assert (psp==pph).all()

# example how to do simple vectorized calculations

def simple_stats(data,iptr):
    min = np.minimum.reduceat(data,iptr[:-1])
    mean = np.add.reduceat(data,iptr[:-1]) / np.diff(iptr)
    return min, mean

def bin_and_do_simple_stats(idx,data,mx=-1):
    data,indices,indptr = sort_to_bins(idx,data,mx)
    return simple_stats(data,indptr)

print("minima: {}\n mean values: {}".format(*bin_and_do_simple_stats(idx,data)))

If you have pythran (not required but a bit faster), compile this as <stb_pthr.py>:

import numpy as np

#pythran export sort_to_bins(int[:], int)

def sort_to_bins(idx, mx):
    if mx==-1:
        mx = idx.max() + 1
    cnts = np.zeros(mx + 2, int)
    for i in range(idx.size):
        cnts[idx[i]+2] += 1
    for i in range(2, cnts.size):
        cnts[i] += cnts[i-1]
    res = np.empty_like(idx)
    for i in range(idx.size):
        res[cnts[idx[i]+1]] = i
        cnts[idx[i]+1] += 1
    return res, cnts[:-1]

thank you in advance i will test it. But what i have to add is that the arrays are 2d arrays, where one of the arrays is thresholded and compoenets are calculated with opencv2. So i get for every component 6 values for the standard stats(min,mean,max) - 3 for each array — Varlor, Aug 14 '19 at 20:55
@Varlor both methods can only handle 1d arrays, so you'll have to ravel them. — Paul Panzer, Aug 14 '19 at 21:08
so idx in your code is the 2d labelarray(matrix) and i also have to flatten it? And i have to execute it two times like in Edit2: — Varlor, Aug 14 '19 at 21:14
@Varlor yes, you have to flatten idx; no, you only have to call the function once, because indices2 = indices, indptr2 = indptr, data2 = flattenedData2[indices] — Paul Panzer, Aug 14 '19 at 22:09
ok it took me a while to understand the ouput and sparce matrices. But I have some questions left. 1. Why to create first csr and convert to csc, why not create directly csc. 2. I do not really understand how the flattened label array now could act as the indices in the sparce matrix. 3. Is it right according to your posted link that for only a few components lets say 0 to 100 argsort would be a better solution? — Varlor, Aug 15 '19 at 13:14
@Varlor 1) + 2) what we are doing is creating a matrix that has as many rows as you have pixels and as many columns as you have labels; it is mostly zero but has ones for each pixel-label pair. For efficient storage and speed we use a sparse format which omits all zeros. In csr the nonzeros are stored row by row, so labels are listed in pixel order, same order as they occur originally, which is why we can create the csr from the original idx. csc storage is column by column, so pixels are listed in label order, which is why after conversion `indices` has flattened pixel offsets grouped by ... — Paul Panzer, Aug 16 '19 at 01:29
... label. 3) You'd have to test. Yes, generally for smaller problem size argsort may be quicker, but the plot is for fixed number of lables (100), x-axis is number of pixels — Paul Panzer, Aug 16 '19 at 01:30

Get all component stats of multiple arrays labeled by one of them

1 Answers1