I have a huge image dataset that does not fit in memory. I want to compute the mean
and standard deviation
, loading images from disk.
I'm currently trying to use this algorithm found on wikipedia.
# for a new value newValue, compute the new count, new mean, the new M2.
# mean accumulates the mean of the entire dataset
# M2 aggregates the squared distance from the mean
# count aggregates the amount of samples seen so far
def update(existingAggregate, newValue):
(count, mean, M2) = existingAggregate
count = count + 1
delta = newValue - mean
mean = mean + delta / count
delta2 = newValue - mean
M2 = M2 + delta * delta2
return existingAggregate
# retrieve the mean and variance from an aggregate
def finalize(existingAggregate):
(count, mean, M2) = existingAggregate
(mean, variance) = (mean, M2/(count - 1))
if count < 2:
return float('nan')
else:
return (mean, variance)
This is my current implementation (computing just for the red channel):
count = 0
mean = 0
delta = 0
delta2 = 0
M2 = 0
for i, file in enumerate(tqdm(first)):
image = cv2.imread(file)
for i in range(224):
for j in range(224):
r, g, b = image[i, j, :]
newValue = r
count = count + 1
delta = newValue - mean
mean = mean + delta / count
delta2 = newValue - mean
M2 = M2 + delta * delta2
print('first mean', mean)
print('first std', np.sqrt(M2 / (count - 1)))
This implementation works close enough on a subset of the dataset I tried.
The problem is that it is extremely slow and therefore nonviable.
Is there a standard way of doing this?
How can I adapt this for faster result or compute the RGB mean and standard deviation for all the dataset without loading it all in memory at the same time and at reasonable speed?