17

I have a huge image dataset that does not fit in memory. I want to compute the mean and standard deviation, loading images from disk.

I'm currently trying to use this algorithm found on wikipedia.

# for a new value newValue, compute the new count, new mean, the new M2.
# mean accumulates the mean of the entire dataset
# M2 aggregates the squared distance from the mean
# count aggregates the amount of samples seen so far
def update(existingAggregate, newValue):
    (count, mean, M2) = existingAggregate
    count = count + 1 
    delta = newValue - mean
    mean = mean + delta / count
    delta2 = newValue - mean
    M2 = M2 + delta * delta2

    return existingAggregate

# retrieve the mean and variance from an aggregate
def finalize(existingAggregate):
    (count, mean, M2) = existingAggregate
    (mean, variance) = (mean, M2/(count - 1)) 
    if count < 2:
        return float('nan')
    else:
        return (mean, variance)

This is my current implementation (computing just for the red channel):

count = 0
mean = 0
delta = 0
delta2 = 0
M2 = 0
for i, file in enumerate(tqdm(first)):
    image = cv2.imread(file)
    for i in range(224):
        for j in range(224):
            r, g, b = image[i, j, :]
            newValue = r
            count = count + 1
            delta = newValue - mean
            mean = mean + delta / count
            delta2 = newValue - mean
            M2 = M2 + delta * delta2

print('first mean', mean)
print('first std', np.sqrt(M2 / (count - 1)))

This implementation works close enough on a subset of the dataset I tried.

The problem is that it is extremely slow and therefore nonviable.

  • Is there a standard way of doing this?

  • How can I adapt this for faster result or compute the RGB mean and standard deviation for all the dataset without loading it all in memory at the same time and at reasonable speed?

Bruno Klein
  • 3,217
  • 5
  • 29
  • 39
  • Olla Bruno... it sounds like multi-processing is your solution. In the end you need to load each images into mem.. eventually ... but with multi-process you can split the workload and split the file-list into chunks that can be processed more easily and thus won't drain your mem completely. Use thread1 for setting up the process and then spin-out (depending on your cores) 4-7 thread process per unit time over 4-7 cores. If processed fast you may double the threads send per core. Think about pythons example in threading...ThreadedTCPServer(SocketServer.ThreadingMixIn, SocketServer.TCPServer. – ZF007 Dec 17 '17 at 00:23
  • https://stackoverflow.com/a/20888445/8928024 – ZF007 Dec 17 '17 at 00:28

3 Answers3

10

Since this is a numerically heavy task (a lot of iterations around a matrix, or a tensor), I always suggest to use libraries that are good at this: numpy.

A properly installed numpy should be able to utilize the underlying BLAS (Basic Linear Algebra Subroutines) routines which are optimized for operating an array of floating points from the memory hierarchy perspective.

imread should already give you the numpy array. You can get the reshaped 1d array of the image of the red channel by

import numpy as np
val = np.reshape(image[:,:,0], -1)

the mean of such by

np.mean(val)

and the standard deviation by

np.std(val)

In this way, you can get rid of two layers of python loops:

count = 0
mean = 0
delta = 0
delta2 = 0
M2 = 0
for i, file in enumerate(tqdm(first)):
    image = cv2.imread(file)
        val = np.reshape(image[:,:,0], -1)
        img_mean = np.mean(val)
        img_std = np.std(val)
        ...

The rest of the incremental update should be straightforward.

Once you have done this, the bottleneck will become the image loading speed, which is limited by disk read operation performance. For that regard, I suspect using multi-thread as others suggested will help much based on my prior experience.

Yo Hsiao
  • 678
  • 7
  • 12
  • lingo-word "baed" = ? – ZF007 Dec 17 '17 at 01:21
  • @ZF007: fixed. Thanks! – Yo Hsiao Dec 17 '17 at 01:23
  • > thumbs up < ;-) – ZF007 Dec 17 '17 at 01:24
  • 1
    But using this approach, how can I get the overall `std` of the dataset? I need the individual channel values for that. – Bruno Klein Dec 17 '17 at 01:25
  • 2
    @Bruno Klein The one you quoted above is a numerically stable method, which is great. But you can also go for a more naive one if it won't overflow: the mean of the sum of an array of each image mean is the overall mean. The std is a tricky one. If you don't look after a world-class implementation, probably store sum of squares (sos) for each images and incrementally sum over. Once it is done, compute `sqrt(sum_sos/N - mean**2)` (you will want to find out what `N` is). Note that if `sum_sos` is too close to `mean**2`, you may want to be aware of numerical issues. – Yo Hsiao Dec 17 '17 at 01:46
  • @YoHsiao this comment is so important that it is worth being added to the answer :) – Shaohua Li Jan 19 '19 at 04:34
5

You can use also opencv's method meanstddev.

cv2.meanStdDev(src[, mean[, stddev[, mask]]]) → mean, stddev
Andrey Smorodov
  • 10,649
  • 2
  • 35
  • 42
1

If you don't want to put things into a memory with an array containing your entire dataset, you can just compute it iteratively

# can be whatever I just made this number up
global_mean = 134.0
# just get a per-pixel array with the vals for (x_i - mu) ** 2 / |x|
sums = ((images[0] - global_mean) ** 2) / len(images)

for img in images[1:]:
    sums = sums + ((img - global_mean) ** 2) / len(images)

# Get mean of all per-pixel variances, and then take sqrt to get std
dataset_std = np.sqrt(np.mean(sums))
Greg
  • 5,422
  • 1
  • 27
  • 32