Generate large amounts of data with numpy

Question

Trying to generate a lot of data with numpy without success. To be clear is not the libraries fault - I just do not have enough memory in my computer ~ 32gb of ram.

Is there a better way to do this such as simultaneously save on disk rather than in memory?

Here is the working code:

import numpy as np
import tensorflow as tf


def main():
    
    (images, labels), (_, _) = tf.keras.datasets.mnist.load_data(path="mnist.npz")
    rescaled_images = np.reshape(images, (images.shape[0], images.shape[1] * images.shape[1]))

    indices = np.where(labels==0)
    zero_images = np.take(rescaled_images, indices, axis=0)[0]
    x0 = np.array(zero_images).astype(np.float32)

    length = zero_images.shape[1]
    ones = np.ones(length)
    
    steps = 1000000 # <--- problem even with one step
    mu = -2.0
    sigma = 1.5
    dt = 0.01

    data = []
    for image in x0:
        diff_imgs = [image]
        x = x0
        for i in range(steps):
            x = x + mu * ones * dt + sigma * ones * np.sqrt(dt) * np.random.randn(length)
            diff_imgs.append(x)
        data.append(diff_imgs)

    data = np.array(data, dtype=np.float32)

if __name__ == '__main__':
    main()

KrisR89 · Accepted Answer · 2021-05-20T13:05:23.343

Saving data to disk, can give you some more headroom. But when you dont need all the data simultaneously, or a saved history of the data generated, you can get away with just creating a recipe of how to create the data.

This idea can be achieved by creating a generator using the yield instead of return (for more generator information see this thread). Hence you create the recipe for the logic and then just call the next function of that object every time you need more data.

An example from your case:

import numpy as np
import tensorflow as tf
import 

# Constants:
MU = -2.0
SIGMA = 1.5
DT = 0.01


def img_add_noise(image):
    return(image+MU*DT+SIGMA*np.sqrt(DT)*np.random.randn(len(image)))

def data_add_noise(images, steps):
    for image in images:
        batch_data = [image]
        for i in range(steps):
            batch_data.append(img_add_noise(batch_data[-1]))
        yield(np.stack(batch_data).astype(np.float32))


def main():
    (images, labels), (_,_) = tf.keras.datasets.mnist.load_data(path="mnist.npz")
    rescaled_images = images.reshape(len(images), -1)
    zero_images = np.take(rescaled_images, indices, axis=0)[0]
    data = data_add_noise(zero_images, 10)
    sample_data = next(data)
    print(sample_data.shape)

if __name__ == '__main__':
    main()

This way you can also make a function that can generate an infinite amount of data by change the for loop of for image in images in data_add_noise function to

counter = 0
while True:
    image = images[counter%len(images)]
    counter += 1

If the generated data is being used for training tensorflow models, have a look here, how to use sequence to achieve that.

Generate large amounts of data with numpy

1 Answers1