how to restore the learning rate in TF from previously saved checkpoint ?

Question

I have stopped training at some point and saved checkpoint, meta files etc. Now when I want to resume training, I want to start with last running learning rate of the optimizer. Can you provide a example of doing so ?

Do you use basic tensorflow or with some high-level abstraction like scikit or tflearn? For basic case see https://stackoverflow.com/questions/33759623/tensorflow-how-to-save-restore-a-model — S. Stas, Jul 03 '17 at 13:26
You can treat learning rate as a regular tensorflow variable. So you can set it, save and load as other variables then. Technicaly, among the checkpoint files, you can find the meta chkp file containing model's Protobufs including metadata as learning rate. Though I've never tried to use it. — S. Stas, Jul 03 '17 at 13:45
yeah but I need an example extracting l.r. from the meta file. — Rajarshee Mitra, Jul 03 '17 at 14:09

score 2 · Answer 1 · answered Feb 26 '18 at 15:44

2

For those coming here (like me) wondering whether the last learning rate is automatically restored: tf.train.exponential_decay doesn't add any Variables to the graph, it only adds the operations necessary to derive the correct current learning rate value given a certain global_step value. This way, you only need to checkpoint the global_step value (which is done by default normally) and, assuming you keep the same initial learning rate, decay steps and decay factor, you'll automatically pick up training where you left it, with the correct learning rate value.

Inspecting the checkpoint won't show any learning_rate variable (or similar), simply because there is no need for any.

answered Feb 26 '18 at 15:44

GPhilo

18,519
9
63
89

Does having a `ReduceLROnPLateau` callback change the original learning rate? – User 10482 Nov 15 '21 at 01:57
I believe it works the same way: what's stored is the global step only, you need to recreate the callback with the same parameters – GPhilo Nov 15 '21 at 07:38
1

I did some checking. The callback changed the optimizer learning rate at it's retained. – User 10482 Nov 15 '21 at 20:35

score 0 · Answer 2 · edited Feb 23 '19 at 17:22

This example code learns to add two numbers:

import tensorflow as tf
import numpy as np
import os


save_ckpt_dir = './add_ckpt'
ckpt_filename = 'add.ckpt'

save_ckpt_path = os.path.join(save_ckpt_dir, ckpt_filename)

if not os.path.isdir(save_ckpt_dir):
    os.mkdir(save_ckpt_dir)

if [fname.startswith("add.ckpt") for fname in os.listdir(save_ckpt_dir)]:  # prefer to load pre-trained net
    load_ckpt_path = save_ckpt_path
else:
    load_ckpt_path = None  # train from scratch


def add_layer(inputs, in_size, out_size, activation_fn=None):

    Weights = tf.Variable(tf.ones([in_size, out_size]), name='Weights')
    biases = tf.Variable(tf.zeros([1, out_size]), name='biases')
    Wx_plus_b = tf.add(tf.matmul(inputs, Weights), biases)
    if activation_fn is None:
        layer_output = Wx_plus_b
    else:
        layer_output = activation_fn(Wx_plus_b)
    return layer_output


def produce_batch(batch_size=256):
    """Loads a single batch of data.

    Args:
      batch_size: The number of excersises in the batch.

    Returns:
      x : column vector of numbers
      y : another column of numbers
      xy_sum : the sum of the columns
    """
    x = np.random.random(size=[batch_size, 1]) * 10
    y = np.random.random(size=[batch_size, 1]) * 10
    xy_sum = x + y
    return x, y, xy_sum


with tf.name_scope("inputs"):
    xs = tf.placeholder(tf.float32, [None, 1])
    ys = tf.placeholder(tf.float32, [None, 1])

with tf.name_scope("correct_labels"):
    xysums = tf.placeholder(tf.float32, [None, 1])

with tf.name_scope("step_and_learning_rate"):
    global_step = tf.Variable(0, trainable=False)
    lr = tf.train.exponential_decay(0.15, global_step, 10, 0.96)  # start lr=0.15, decay every 10 steps with a base of 0.96

with tf.name_scope("graph_body"):
    prediction = add_layer(tf.concat([xs, ys], 1), 2, 1, activation_fn=None)

with tf.name_scope("loss_and_train"):
    # the error between prediction and real data
    loss = tf.reduce_mean(tf.reduce_sum(tf.square(xysums-prediction), reduction_indices=[1]))

    # Passing global_step to minimize() will increment it at each step.
    train_step = tf.train.AdamOptimizer(lr).minimize(loss, global_step=global_step)


with tf.name_scope("init_load_save"):
    init = tf.global_variables_initializer()
    saver = tf.train.Saver()

with tf.Session() as sess:
    sess.run(init)
    if load_ckpt_path:
        saver.restore(sess, load_ckpt_path)
    for i in range(1000):
        x, y, xy_sum = produce_batch(256)
        _, global_step_np, loss_np, lr_np = sess.run([train_step, global_step, loss, lr], feed_dict={xs: x, ys: y, xysums: xy_sum})
        if global_step_np % 100 == 0:
            print("global step: {}, loss: {}, learning rate: {}".format(global_step_np, loss_np, lr_np))

    saver.save(sess, save_ckpt_path)

if you run it a few times, you will see the learning rate decrease. It also saves the global step. The trick is here:

with tf.name_scope("step_and_learning_rate"):
    global_step = tf.Variable(0, trainable=False)
    lr = tf.train.exponential_decay(0.15, global_step, 10, 0.96)  # start lr=0.15, decay every 10 steps with a base of 0.96
...
train_step = tf.train.AdamOptimizer(lr).minimize(loss, global_step=global_step)

By default, saver.save will save all savable objects (including learning rate and global step). However, if tf.train.Saver is provided with var_list, saver.save will only save the vars included in var_list:

saver = tf.train.Saver(var_list = ..list of vars to save..)

sources: https://www.tensorflow.org/api_docs/python/tf/train/exponential_decay

https://stats.stackexchange.com/questions/200063/tensorflow-adam-optimizer-with-exponential-decay

https://www.tensorflow.org/api_docs/python/tf/train/Saver (see "saveable objects")

how to restore the learning rate in TF from previously saved checkpoint ?

2 Answers2