2

I'm fairly new to reinforcement learning and I've built an agent that feeds two inputs to its neural network (first input is a tuple with two numbers representing the agents current position | second input is an array of numbers ranging from 0 to 3 representing what type of requests the agent receives from the environment) and outputs which movement is the best (move forwards, backwards, sideways etc...)

Each episode has 300 steps, the for loop inside the train_pos_nn() takes +5s (each call to predict() takes about 20ms and each call to fit() takes about 7ms), which amounts to +25 minutes per episode, which is too much time. (about 17 days to finish 1000 episodes which is the required number of episodes to converge / it takes the same amount of time on Google Colab ((Edit: even when using the GPU option and gpu cannot be setup to be used on my local machine))

Is there any way I can reduce the amount of time it takes the agent to train ?

n_possible_movements = 9
MINIBATCH_SIZE = 32

class DQNAgent(object):
    def __init__(self):
        #self.gamma = 0.95 
        self.epsilon = 1.0
        self.epsilon_decay = 0.8
        self.epsilon_min = 0.1
        self.learning_rate = 10e-4 
        self.tau = 1e-3
                        
        # Main models
        self.model_uav_pos = self._build_pos_model()

        # Target networks
        self.target_model_uav_pos = self._build_pos_model()
        # Copy weights
        self.target_model_uav_pos.set_weights(self.model_uav_pos.get_weights())

        # An array with last n steps for training
        self.replay_memory_pos_nn = deque(maxlen=REPLAY_MEMORY_SIZE)
        
    def _build_pos_model(self): # compile the DNN
        # create the DNN model
        dnn = self.create_pos_dnn()
        
        opt = Adam(learning_rate=self.learning_rate) #, decay=self.epsilon_decay)
        dnn.compile(loss="categorical_crossentropy", optimizer=opt, metrics=['accuracy'])
        
        return dnn
    
    def create_pos_dnn(self): 
        # initialize the input shape (The shape of an array is the number of elements in each dimension)
        pos_input_shape = (2,)
        requests_input_shape = (len(env.ues),)
        # How many possible outputs we can have
        output_nodes = n_possible_movements
        
        # Initialize the inputs
        uav_current_position = Input(shape=pos_input_shape, name='pos')
        ues_requests = Input(shape=requests_input_shape, name='requests')
        
        # Put them in a list
        list_inputs = [uav_current_position, ues_requests]
        
        # Merge all input features into a single large vector
        x = layers.concatenate(list_inputs)
        
        # Add a 1st Hidden (Dense) Layer
        dense_layer_1 = Dense(512, activation="relu")(x)
        
        # Add a 2nd Hidden (Dense) Layer
        dense_layer_2 = Dense(512, activation="relu")(dense_layer_1)
        
        # Add a 3rd Hidden (Dense) Layer
        dense_layer_3 = Dense(256, activation="relu")(dense_layer_2)
        
        # Output layer
        output_layer = Dense(output_nodes, activation="softmax")(dense_layer_3)

        model = Model(inputs=list_inputs, outputs=output_layer)
                        
        # return the DNN
        return model
    
    def remember_pos_nn(self, state, action, reward, next_state, done):
        self.replay_memory_pos_nn.append((state, action, reward, next_state, done)) 
        
    def act_upon_choosing_a_new_position(self, state): # state is a tuple (uav_position, requests_array)
        if np.random.rand() <= self.epsilon: # if acting randomly, take random action
            return random.randrange(n_possible_movements)
        pos =  np.array([state[0]])
        reqs =  np.array([state[1]])
        act_values = self.model_uav_pos.predict(x=[pos, reqs]) # if not acting randomly, predict reward value based on current state
        return np.argmax(act_values[0]) 
        
    def train_pos_nn(self):
        print("In Training..")

        # Start training only if certain number of samples is already saved
        if len(self.replay_memory_pos_nn) < MIN_REPLAY_MEMORY_SIZE:
            print("Exiting Training: Replay Memory Not Full Enough...")
            return

        # Get a minibatch of random samples from memory replay table
        minibatch = random.sample(self.replay_memory_pos_nn, MINIBATCH_SIZE)

        start_time = time.time()
        # Enumerate our batches
        for index, (current_state, action, reward, new_current_state, done) in enumerate(minibatch):
            print('...Starting Training...')
            target = 0
            pos =  np.array([current_state[0]])
            reqs =  np.array([current_state[1]])
            pos_next = np.array([new_current_state[0]])
            reqs_next = np.array([new_current_state[1]])
    
            if not done:
                target = reward + DISCOUNT * np.amax(self.target_model_uav_pos.predict(x=[pos_next, reqs_next]))
            else:
                target = reward

            # Update Q value for given state
            target_f = self.model_uav_pos.predict(x=[pos, reqs])
            target_f[0][action] = target

            self.model_uav_pos.fit([pos, reqs], \
                                   target_f, \
                                   verbose=2, \
                                   shuffle=False, \
                                   callbacks=None, \
                                   epochs=1 \
                                  )  
        end_time = time.time()
        print("Time", end_time - start_time)
        # Update target network counter every episode
        self.target_train()
        
    def target_train(self):
        weights = self.model_uav_pos.get_weights()
        target_weights = self.target_model_uav_pos.get_weights()
        for i in range(len(target_weights)):
            target_weights[i] = weights[i] * self.tau + target_weights[i] * (1 - self.tau)
        self.target_model_uav_pos.set_weights(target_weights)
# Main 
SIZE = 100 # size of the grid the agent is in
for episode in tqdm(range(1, n_episodes + 1), ascii=True, unit='episodes'):  
    # Reset environment and get initial state
    current_state = env.reset(SIZE)

    # Reset flag and start iterating until episode ends
    done = False
    steps_n = 300

    for t in range(steps_n): 
        # Normalize the input (the current state)
        current_state_normalized = normalize_pos_state(current_state)
        
        # Get new position for the agent
        action_pos = agent_dqn.act_upon_choosing_a_new_position(current_state_normalized)
        
        new_state, reward, done, _ = env.step(action_pos)
        
        agent_dqn.remember_pos_nn(current_state_normalized, action_pos, reward, normalize_pos_state(new_state), done)

        current_state = new_state # not normalized
        
        agent_dqn.train_pos_nn()

    # Decay epsilon
    if episode % 50 == 0:
        if agent_dqn.epsilon > agent_dqn.epsilon_min:
            agent_dqn.epsilon *= agent_dqn.epsilon_decay
            agent_dqn.epsilon = max(agent_dqn.epsilon, agent_dqn.epsilon_min)

Ness
  • 158
  • 1
  • 12

2 Answers2

1

One performance optimization in your training loop is using the call method of a model instead of calling predict, and wrapping it with tf.function. predict is good for batch inference, but there is some overhead, and for single samples, call will likely be faster. Some more details about this difference can be found here. For your purposes, how it might be modified could be:

class DQNAgent(object):

    def _build_pos_model(self): # compile the DNN
        # create the DNN model
        dnn = self.create_pos_dnn()
        
        opt = Adam(learning_rate=self.learning_rate) #, decay=self.epsilon_decay)
        dnn.compile(loss="categorical_crossentropy", optimizer=opt, metrics=['accuracy'])
        dnn.call = tf.function(dnn.call)
        
        return dnn

Then change every call of self.model_uav_pos.predict(..) and self.target_model_uav_pos.predict(...) to self.model_uav_pos(...) and self.target_model_uav_pos(...), respectively.

Further potential optimizations could be to JIT compile the TF function buy supplying jit_compile=True to the tf.function wrapper e.g;

dnn.call = tf.function(dnn.call, jit_compile=True)

Update

It looks like using the call method instead of predict, wrapping the call method in tf.function, and using JIT compilation improved performance over 2x (5s -> 2s), which is an appreciable difference. For further optimizations, although I don't think they will bring you much further down, rather than just wrapping call the other computations after call could be wrapped in tf.function as well, so they all become one callable Tensorflow graph. For example:

        act_values = self.model_uav_pos(x=[pos, reqs]) 
        return np.argmax(act_values[0]) 

Rather than calling np.argmax afterwards call, we could use tf.argmax, then wrap both in a tf.function. So the revised implementation could be:

class DQNAgent(object):
    def __init__(self):
        #self.gamma = 0.95 
        self.epsilon = 1.0
        self.epsilon_decay = 0.8
        self.epsilon_min = 0.1
        self.learning_rate = 10e-4 
        self.tau = 1e-3
                        
        # Main models
        self.model_uav_pos = self._build_pos_model()
        self.pred_model_uav = tf.function(lambda x: tf.argmax(self.model_uav_pos(x)), jit_compile=True)

        # Target networks
        self.target_model_uav_pos = self._build_pos_model()
        # Copy weights
        self.target_model_uav_pos.set_weights(self.model_uav_pos.get_weights())
        self.pred_target_model_uav = tf.function(lambda x: tf.reduce_max(self.target_model_uav_pos(x)), jit_compile=True)

Then replace every call replaced in the originally proposed solutions with the corresponding new predict methods defined (e.g; instead of self.model_uav_pos(...) call self.pred_model_uav_pos(...)), and remove the numpy function calls after the predictions. Note in this implementation, dnn.call = tf.function(dnn.call) is removed from _build_pos_model, as we're now wrapping later.

The benefit to this approach is with JIT compiling the other computations (argmax and max) that are ultimately applied to the result, additional optimizations can potentially be made to the graph by fusing operations. Some additional details about this idea, along with a simple example of softmax, can be found in here.

As I said, I don't think this will result in a drastic further improvement, but it may shave off some additional time in the loop.

Update 2

I will revise my suggestion from the previous update, as I realized calling model_uav_pos for inference occurs in two places - once in act_upon_choosing_a_new_position where it's followed by the argmax and once in train_pos_nn where just the output is utilized. I would suggest either wrapping call method of model_uav_pos with tf.function after defining self.pred_model_uav, so both inference functions are compiled into Tensorflow graphs:

class DQNAgent(object):
    def __init__(self):
        #self.gamma = 0.95 
        self.epsilon = 1.0
        self.epsilon_decay = 0.8
        self.epsilon_min = 0.1
        self.learning_rate = 10e-4 
        self.tau = 1e-3
                        
        # Main models
        self.model_uav_pos = self._build_pos_model()
        self.pred_model_uav = tf.function(lambda x: tf.argmax(self.model_uav_pos(x)), jit_compile=True)
        self.model_uav_pos.call = tf.function(self.model_uav_pos.call, jit_compile=True)

...

And in the act_upon_choosing_a_new_position method, self.pred_model_uav is used, and in the train_pos_nn method, just call self.model_uav_pos as was detailed in the original solution.

danielcahall
  • 2,672
  • 8
  • 14
  • Thank you for your answer :) the training loop now takes about +2s instead of +5s .. Is there any other way I could make it run even faster ? – Ness May 28 '22 at 20:56
  • Glad to hear it resulted in such an improvement! I added an update which could potentially further improve performance (albeit slightly) – danielcahall May 29 '22 at 04:14
  • Thank you again :) one further question: How do I replace `target_f = self.model_uav_pos([pos, reqs], training=False) target_f = np.array(target_f) target_f[0][action] = target` ? – Ness May 29 '22 at 10:47
  • The change would be how ’target_f’ is assigned - it would now be ‘target_f = np.array(self.pred_target_model_uav([pos, reqs]))’ since the prediction and max along an axis will now be computed in ‘pred_target_model_uav’ – danielcahall May 29 '22 at 11:35
  • So `target_f[0][action] = target` is not required ? – Ness May 29 '22 at 11:48
  • No, that line would still be required as well. The full change would be: `target_f = np.array(self.pred_target_model_uav([pos, reqs])) target_f[0][action] = target ` I was just highlighting the main difference in the current implementation and the proposed change for that line. – danielcahall May 29 '22 at 12:29
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/245145/discussion-between-ness-and-danielcahall). – Ness May 29 '22 at 12:34
  • To my understanding, self.pred_model_uav([pos, reqs]) returns the index of the max prediction, but in the original code aka target_f = self.model_uav_pos([pos, reqs]) it returns a list of predictions, so I'm a bit confused as to how target_f[0][action] = target would still work – Ness May 29 '22 at 13:23
  • self.pred_model_uav = tf.function(lambda x: tf.argmax(self.model_uav_pos(x)), jit_compile=True) – Ness May 29 '22 at 13:24
  • Originally: target_f = self.model_uav_pos([pos, reqs]); target_f = np.array(target_f); target_f[0][action] = target; – Ness May 29 '22 at 13:24
  • 1
    Ah, I think I was looking at the wrong code snippet! My apologies - you are right, in the case where `target_f` is being computed and assigned, the `pred_model_uav_pos` method would not be used - in that case, just use the `call` method as detailed previously (`target_f = self.model_uav_pos(...)`). `pred_model_uav_pos` would be used in `act_upon_choosing_a_new_position` as that's where the argmax is being computed, and `pred_target_model_uav` would be used in the `train_pos_nn` method's loop in the `done` check, as that's where the maximum along an axis is computed. – danielcahall May 30 '22 at 03:07
  • That makes sense. Just to verify though, I call `self.model_uav_pos()` without the need for `dnn.call = tf.function(dnn.call)` in `_build_pos_model()` function ? – Ness May 30 '22 at 09:36
  • Added another update detailing the approach. It looks like you would need to wrap the `call` method of `model_uav_pos`, so after it's returned in the `_build_pos_model` it can be wrapped. In this scenario we have three functions compiled into TF graphs - I'm not sure if it will yield much further performance improvement but I think it's worth tesing. – danielcahall May 30 '22 at 12:42
-1

Utilizing a GPU (Graphics Processing Unit) will always make model training faster. You can follow these steps to train your model on a GPU:

How to Finally Install TensorFlow 2 GPU on Windows 10 in 2022:

  • Step 1: Find out the TF version and its drivers.
  • Step 2: Install Microsoft Visual Studio
  • Step 3: Install the NVIDIA CUDA toolkit
  • Step 4: Install cuDNN
  • Step 5: Extract the ZIP folder and copy core directories
  • Step 6: Add CUDA toolkit to PATH
  • Step 7: Install TensorFlow inside a virtual environment with Jupyter Lab

(Detailed instruction in the link above)

However, you can use Google Colab, as it has a GPU option that doesn't require you to do any installations. You can change the accelerator in colab settings: Runtime -> Change runtime type -> None/GPU/TPU.

Innat
  • 16,113
  • 6
  • 53
  • 101
Red
  • 26,798
  • 7
  • 36
  • 58