Using Dask to Chunk Large Dataset

Question

I am working now on a large dataset of images of a shape (10000000,1,32,32), where the format goes (instances, channel, height, width). I was able to load the data and turn it into chunk sizes but my concern now lies on how to train my CNN model using these chunks.

import dask.array as da
X = da.from_array(f['features'], chunks=(1000, 1, 32,32))
y =  da.from_array(f['targets'], chunks=(1000,1))

And below is a snapshot of how the data now looks like. I only have 6gb of GPU memory and wanted to use dask to do the chunks. However, when I run my cnn

lr_scheduler = tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=3, min_lr=1.e-6)
early_stopping_cb = tf.keras.callbacks.EarlyStopping(monitor="val_loss",patience=5, restore_best_weights=True)
callbacks = [lr_scheduler, early_stopping_cb]

inputs1 = keras.Input(shape=(1,32,32), batch_size = 64)
x1 = Conv2D(32, kernel_size=2,  padding='same', strides = 1, activation='relu', kernel_initializer='TruncatedNormal',
           data_format='channels_first')(inputs1)
x1 = Conv2D(32,  kernel_size=2, padding='same', strides = 1, activation='relu',kernel_initializer='TruncatedNormal',
           data_format='channels_first')(x1)
x1 = MaxPooling2D(pool_size=(2, 2), data_format='channels_first')(x1)
x1 = Conv2D(64,  kernel_size=3, padding='same', strides = 1, activation='relu',kernel_initializer='TruncatedNormal',
           data_format='channels_first')(x1)
x1 = Conv2D(64,  kernel_size=3, padding='same', strides = 1, activation='relu',kernel_initializer='TruncatedNormal',
           data_format='channels_first')(x1)
x1 = Flatten()(x1)
x1 = Dense(128, activation='relu', kernel_initializer='TruncatedNormal')(x1)
x1 = Dropout(0.2)(x1)
x1 = Dense(32, activation='relu', kernel_initializer='TruncatedNormal')(x1)
x1 = Dropout(0.2)(x1)
outputs1 = Dense(1, activation='sigmoid', kernel_initializer='TruncatedNormal')(x1)
conv2d = keras.Model(inputs=inputs1, outputs=outputs1)
conv2d.compile(loss = "binary_crossentropy", optimizer = "adam", metrics = ["accuracy"])
conv2d.fit(X,y, batch_size=64, epochs = 30,
                    callbacks=callbacks)

It returns something that says WARNING:tensorflow:Keras is training/fitting/evaluating on array-like data. Keras may not be optimized for this format, so if your input data format is supported by TensorFlow I/O (https://github.com/tensorflow/io) we recommend using that to load a Dataset instead. and the training time goes just around 30 hours for one epoch. I know that this is something wrong, but how should I deal with it? I read some documentations online but found them confusing as I am only new to the Dask Framework. Any help will be appreciated!

There is a Tensorflow I/O package which probably has better in memory data representation optimized for tensorflow. It apparently can also load data in chunks. But for this, your data has to be in tfrecord format. Maybe you can use dask or ray datasets or polars to transform the data into this format first. — dre-hh, Apr 18 '23 at 20:54
Thanks for the response! I do understand that it should be in tfrecord format but when I looked up the documentation, I was overwhelmed and could not figure out where to start. Apparently, I need to create an input pipeline using tf.data. Can you perhaps provide a pseudocode for this? — Newbie, Apr 19 '23 at 02:00
Sorry...this is not entierly my domain either. I would only output what chatgpt (gpt4) gives me :)) But it can give you quite some precise answers , which will be easier to double chekc against the docs. Also there is copilot X from github which can read the docs and use chatgpt like model to answer precise questions — dre-hh, Apr 19 '23 at 11:36

Using Dask to Chunk Large Dataset

0 Answers0