Using KFold cross validation to get MAE for each data split

Question

I want to get mean absolute error (MAE) for each split of data using 5-fold cross validation. I have built a custom model using Xception.

Hence, to try this, I coded the following:

# Data Generators:
train_gen = flow_from_dataframe(core_idg, train_df, 
                                path_col = 'path',
                                y_col = 'boneage_zscore', 
                                target_size = IMG_SIZE,
                                color_mode = 'rgb',
                                batch_size = 32,
                                shuffle = True)

X_train, Y_train = next(train_gen)

#-----------------------------------------------------------------------
# Custom Model initiation:
    
base_model = Xception(input_shape = X_train.shape[1:], include_top = False, weights = 'imagenet')
base_model.trainable = True

model = Sequential()
model.add(base_model)
model.add(GlobalMaxPooling2D())
model.add(Flatten())

model.add(Dense(16, activation = 'relu'))
model.add(Dense(1, activation = 'linear'))

def mae_months(in_gt, in_pred):
    return mean_absolute_error(boneage_div * in_gt, boneage_div * in_pred) 

# Compile model
adam = Adam(learning_rate = 0.0005)
model.compile(loss = 'mse', optimizer = adam, metrics = [mae_months])

#-----------------------------------------------------------------------
# KFold
n_splits = 5
kf = KFold(n_splits = n_splits, shuffle = True, random_state = 42)

I coded up to KFold, but now I am stuck with proceeding to the cross validation step to get MAE for each data splits?

A post here suggests a for loop for each Kfold splits, but that's only if the model such as DecisionTreeRegressor() is used instead of a custom model using Xception like mine?

UPDATE

After referring to the suggestion below, I applied the code as follows after the using KFold:

# Data Generators:
train_gen = flow_from_dataframe(core_idg, train_df, 
                                path_col = 'path',
                                y_col = 'boneage_zscore', 
                                target_size = IMG_SIZE,
                                color_mode = 'rgb',
                                batch_size = 1024,
                                shuffle = True)
...
...
...

mae_list = []
n_splits = 5
kf = KFold(n_splits = n_splits, shuffle = True, random_state = 42)
split = kf.split(X_train, Y_train) # X_train, Y_train = next(train_gen) from above

for train, test in split:
    x_train, x_test, y_train, y_test = X_train[train], X_train[test], Y_train[train], Y_train[test]
    history = model.fit(x_train, y_train, validation_data = (x_test, y_test), batch_size = 16)
    pred = model.predict(x_test, batch_size = 8)
    err = mean_absolute_error(y_test, pred)
    mae_list .append(err)

I set the batch size of train_gen to like 1024 first then run the code above, however, I get the following error:

52/52 [==============================] - 16s 200ms/step - loss: 0.9926 - mae_months: 31.5353 - val_loss: 4.4153 - val_mae_months: 81.5463
52/52 [==============================] - 9s 172ms/step - loss: 0.4185 - mae_months: 21.4242 - val_loss: 0.7401 - val_mae_months: 29.3815
52/52 [==============================] - 9s 172ms/step - loss: 0.2930 - mae_months: 17.3729 - val_loss: 0.5628 - val_mae_months: 23.9055
 9/52 [====>.........................] - ETA: 7s - loss: 0.2355 - mae_months: 16.7444

ResourceExhaustedError                    Traceback (most recent call last)
Input In [11], in <cell line: 9>()
     10 x_train, x_test, y_train, y_test = X_train[train], X_train[test], Y_train[train], Y_train[test]
     11 # model = boneage_model()
     12 # history = model.fit(train_gen, validation_data = (x_test, y_test))
---> 13 history = model.fit(x_train, y_train, validation_data = (x_test, y_test), batch_size = 16)
     14 pred = model.predict(x_test, batch_size = 8)
     15 err = mean_absolute_error(y_test, pred)

ResourceExhaustedError: Graph execution error:

....
....
....

Node: 'gradient_tape/sequential/xception/block14_sepconv2/separable_conv2d/Conv2DBackpropFilter'
OOM when allocating tensor with shape[2048,1536,1,1] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[{{node gradient_tape/sequential/xception/block14_sepconv2/separable_conv2d/Conv2DBackpropFilter}}]]

The memory allocation looks like this from the prompt (hopefully this makes sense):

total_region_allocated_bytes_: 5769199616 
memory_limit_: 5769199616 
available bytes: 0 
curr_region_allocation_bytes_: 8589934592

Stats:
Limit:                      5769199616
InUse:                      5762760448
MaxInUse:                   5769190400
NumAllocs:                      192519
MaxAllocSize:               2470510592
Reserved:                            0
PeakReserved:                        0
LargestFreeBlock:                    0

Is it because my GPU cannot take the batch_size?

UPDATE 2 I have decreased the batch_size of the train_gen to 32. Took out the batch_size from the fit() and predict() method. Is this the right way to determine the MAE for each data split?

Code:

# Data Generators:
train_gen = flow_from_dataframe(core_idg, train_df, 
                                path_col = 'path',
                                y_col = 'boneage_zscore', 
                                target_size = IMG_SIZE,
                                color_mode = 'rgb',
                                batch_size = 32,
                                shuffle = True)

X_train, Y_train = next(train_gen)
...
...
...

mae_list = []
n_splits = 5
kf = KFold(n_splits = n_splits, shuffle = True, random_state = 42)
split = kf.split(X_train, Y_train) # X_train, Y_train = next(train_gen) from above

for train, test in split:
    x_train, x_test, y_train, y_test = X_train[train], X_train[test], Y_train[train], Y_train[test]
    history = model.fit(x_train, y_train, validation_data = (x_test, y_test))
    pred = model.predict(x_test)
    err = mean_absolute_error(y_test, pred)
    mae_list.append(err)

UPDATE 3

According to the suggestions from the comments:

Edited the batch_size of the train_gen to 64.
Added valid_gen to use X_valid and y_valid as validation data of the fit() method.
Used x_test for the predict() method.
Added a method for limiting GPU memory growth.

Code:


# Checking the GPU availability
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

physical_devices = tf.config.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(physical_devices[0], True)

...
...
...

# Data Generators:
train_gen = flow_from_dataframe(core_idg, train_df, 
                                path_col = 'path',
                                y_col = 'boneage_zscore', 
                                target_size = IMG_SIZE,
                                color_mode = 'rgb',
                                batch_size = 64,
                                shuffle = True)

X_train, Y_train = next(train_gen)

valid_gen = flow_from_dataframe(core_valid, valid_df, 
                                path_col = 'path',
                                y_col = 'boneage_zscore', 
                                target_size = IMG_SIZE,
                                color_mode = 'rgb',
                                batch_size = 64,
                                shuffle = True)

X_valid, y_valid = next(valid_gen)


# Getting MAE for each data split using 5-fold (KFold)

cv_mae = []
n_splits = 5
kf = KFold(n_splits = n_splits, shuffle = True, random_state = 42)
split = kf.split(X_train, Y_train)

for train, test in split:
    x_train, x_test, y_train, y_test = X_train[train], X_train[test], Y_train[train], Y_train[test]
    history = model.fit(x_train, y_train, validation_data = (X_valid, y_valid))
    pred = model.predict(x_test)
    err = mean_absolute_error(y_test, pred)
    cv_mae.append(err)

cv_mae

The output:

2/2 [==============================] - 8s 2s/step - loss: 3.6179 - mae_months: 66.8136 - val_loss: 2.1544 - val_mae_months: 47.2171
2/2 [==============================] - 1s 394ms/step - loss: 1.0826 - mae_months: 36.3370 - val_loss: 1.6431 - val_mae_months: 40.9770
2/2 [==============================] - 1s 344ms/step - loss: 0.6129 - mae_months: 23.0258 - val_loss: 1.8911 - val_mae_months: 45.6456
2/2 [==============================] - 1s 360ms/step - loss: 0.4500 - mae_months: 22.6450 - val_loss: 1.3592 - val_mae_months: 36.7073
2/2 [==============================] - 1s 1s/step - loss: 0.4222 - mae_months: 20.2543 - val_loss: 1.1010 - val_mae_months: 32.8488

[<tf.Tensor: shape=(13,), dtype=float32, numpy=
 array([1.4442804, 1.3981661, 1.5037801, 2.2199252, 1.7645894, 1.4836203,
        1.7916738, 1.3967942, 1.4069557, 2.516875 , 1.4077926, 1.4342965,
        1.9279695], dtype=float32)>,
 <tf.Tensor: shape=(13,), dtype=float32, numpy=
 array([1.8153722, 1.9236553, 1.3917867, 1.5313213, 1.387209 , 1.3831038,
        1.4519565, 1.4680854, 1.7810788, 2.5733376, 1.4269204, 1.3751   ,
        1.446231 ], dtype=float32)>,
 <tf.Tensor: shape=(13,), dtype=float32, numpy=
 array([1.6616   , 1.6529323, 1.9181525, 2.536807 , 1.6306267, 2.856683 ,
        2.113724 , 1.5543866, 1.9128528, 3.218016 , 1.4112593, 1.4043481,
        3.229338 ], dtype=float32)>,
 <tf.Tensor: shape=(13,), dtype=float32, numpy=
 array([2.1295295, 1.8527019, 1.9779519, 3.1390932, 1.5525225, 2.0811615,
        1.6279813, 1.87973  , 1.5029857, 1.6502519, 2.3677726, 1.8570358,
        1.7251074], dtype=float32)>,
 <tf.Tensor: shape=(12,), dtype=float32, numpy=
 array([1.3926607, 1.7088655, 1.7379242, 3.5756006, 1.5988973, 1.3926607,
        1.4928951, 1.4665956, 1.3926607, 1.4575896, 3.146022 , 1.3926607],
       dtype=float32)>]

Does this mean that I have MAEs for 5 data splits? (where it says numpy = array[....] in the output?)

The model doesn't matter with K-Fold. For example, if you call `sklearn.model_selection.StratifiedKFold()`, it returns an array of `train, test` arrays that you use with a for loop to also index from your main, larger dataset. Then train using those as if those were your datasets. — Djinn, Aug 30 '22 at 15:02
So if I use returned array of `train, test` array, do I use the for loop mentioned by that post to get each MAE for each data splits? How do I do the cross validation step where I could get MAE for each data splits? — Bathtub, Aug 31 '22 at 10:22
According to `UPDATE 2` you still haven't lowered the batch size. It's still 1024 from `flow_from_dataframe()`. And although the structure of your code is fine, the logic could be better. You're using your test set as your validation set too. You should probably pull your validation data from your training data using the `subset` parameter within `flow_from_dataframe()`. Use validation data with `fit()` and test data with `predict()`, although for simple testing just to see if everything works, you're fine as is. — Djinn, Sep 01 '22 at 16:29
Update 3 has the fixes. I also removed the `batch_size` parameter from the `fit()` and `predict()` method. please have a look at the update on the post and let me know if all are good — Bathtub, Sep 02 '22 at 02:39
Thank you. But also, is that the right way to obtain MAE for each data split? — Bathtub, Sep 02 '22 at 09:34

Djinn · Answer 1 · 2022-08-31T16:34:53.023

1

Ideally, you'd split train and test sets together from the kfold split, but it doesn't matter if you use the same seed. kfold split just returns indices to select train and test elements. So you need to get those indices from the split from the original dataset.

Answer based on OP comment and question:

from sklearn.model_selection import StratifiedKFold as kfold

x, y = # images, labels
cvscores = []
kf = kfold(n_splits = n_splits, shuffle = True, random_state = 42)
split = kf.split(x, y)

for train, test in split
    x_train, x_test, y_train, y_test = x[train], x[test], y[train], y[test]
    model = # do model stuff
    _ = model.fit()
    result = mode.evaluate()
    #depending on how you want to handle the results
    cvscores.append(result)
# do stuff with cvscores

I'm not sure if that would work with an object from flow_fromdataframe()` because that wouldn't be an array or array-like, although you should be able to get the arrays within.

edited Aug 31 '22 at 16:34

answered Aug 31 '22 at 16:22

Djinn

663
5
12

Thank you for your suggestion. How would you implement this with my custom model that uses Xception? – Bathtub Aug 31 '22 at 20:31
1

The exact same way you would as if you didn't do cross validation. – Djinn Aug 31 '22 at 23:04
Please see the updated post above :) – Bathtub Aug 31 '22 at 23:31
`batch_size=1024` is probaby way too high. Why not use something like 32 or 64? You should also maybe [limit gpu memory growth.](https://stackoverflow.com/questions/34199233/how-to-prevent-tensorflow-from-allocating-the-totality-of-a-gpu-memory) Also why are you setting `batch_size` in the `fit()` and `predict()` methods? – Djinn Sep 01 '22 at 01:19
When you mentioned "limit GPU memory growth", should I use per_process_gpu_memory_fraction=0.333 from the post? So like, instead of 0.333, something like 0.75? I have 8GB GPU VRAM. For `batch_size`, where should I set it to, if not inside `fit()` and `predict()` methods? – Bathtub Sep 01 '22 at 12:23
"should I use per_process_gpu_memory_fraction=0.333 from the post" No, [take the third answer](https://stackoverflow.com/a/55541385/1676589) with the heading: `For TensorFlow 2.2+ (docs)`. "For batch_size, where should I set it to" You set the batch size with the `batch_size` parameter within `flow_from_dataframe()`. – Djinn Sep 01 '22 at 16:18
You set `batch_size` in `fit()` and `predict()` if your dataset doesn't have a built-in batch size, typically objects not derived from `Dataset` or similar, like if you passed arrays or dataframes to those methods. Or maybe if you just left out `batch_size`, you'll need to set them later in those methods. [According to the documentation on `ImageDataGenerator`](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/image/ImageDataGenerator), all flow* methods have a default batch_size of 32, so there's no need to set a batch_size in `fit()` and `predict()`. – Djinn Sep 01 '22 at 16:20
Thank you for referring on how to limit the GPU growth. I have fixed according to your suggestions. Please have a look at the update on the post – Bathtub Sep 02 '22 at 02:38

Using KFold cross validation to get MAE for each data split

1 Answers1