Understading hyperopt's TPE algorithm

Question

I am illustrating hyperopt's TPE algorithm for my master project and cant seem to get the algorithm to converge. From what i understand from the original paper and youtube lecture the TPE algorithm works in the following steps:

(in the following, x=hyperparameters and y=loss)

Start by creating a search history of [x,y], say 10 points.
Sort the hyperparameters according to their loss and divide them into two sets using some quantile γ (γ = 0.5 means the sets will be equally sized)
Make a kernel density estimation for both the poor hyperparameter group (g(x)) and good hyperparameter group (l(x))
Good estimations will have low probability in g(x) and high probability in l(x), so we propose to evaluate the function at argmin(g(x)/l(x))
Evaluate (x,y) pair at the proposed point and repeat steps 2-5.

I have implemented this in python on the objective function f(x) = x^2, but the algorithm fails to converge to the minimum.

import numpy as np
import scipy as sp
from matplotlib import pyplot as plt
from scipy.stats import gaussian_kde


def objective_func(x):
    return x**2

def measure(x):
    noise = np.random.randn(len(x))*0
    return x**2+noise

def split_meassures(x_obs,y_obs,gamma=1/2):
    #split x and y observations into two sets and return a seperation threshold (y_star)
    size = int(len(x_obs)//(1/gamma))
    l = {'x':x_obs[:size],'y':y_obs[:size]}
    g = {'x':x_obs[size:],'y':y_obs[size:]}
    y_star = (l['y'][-1]+g['y'][0])/2
    return l,g,y_star

#sample objective function values for ilustration
x_obj = np.linspace(-5,5,10000)
y_obj = objective_func(x_obj)

#start by sampling a parameter search history
x_obs = np.linspace(-5,5,10)
y_obs = measure(x_obs)

nr_iterations = 100
for i in range(nr_iterations):

    #sort observations according to loss
    sort_idx = y_obs.argsort()
    x_obs,y_obs = x_obs[sort_idx],y_obs[sort_idx]

    #split sorted observations in two groups (l and g)
    l,g,y_star = split_meassures(x_obs,y_obs)

    #aproximate distributions for both groups using kernel density estimation
    kde_l = gaussian_kde(l['x']).evaluate(x_obj)
    kde_g = gaussian_kde(g['x']).evaluate(x_obj)

    #define our evaluation measure for sampling a new point
    eval_measure = kde_g/kde_l

    if i%10==0:
        plt.figure()
        plt.subplot(2,2,1)
        plt.plot(x_obj,y_obj,label='Objective')
        plt.plot(x_obs,y_obs,'*',label='Observations')
        plt.plot([-5,5],[y_star,y_star],'k')
        plt.subplot(2,2,2)
        plt.plot(x_obj,kde_l)
        plt.subplot(2,2,3)
        plt.plot(x_obj,kde_g)
        plt.subplot(2,2,4)
        plt.semilogy(x_obj,eval_measure)
        plt.draw()

    #find point to evaluate and add the new observation
    best_search = x_obj[np.argmin(eval_measure)]
    x_obs = np.append(x_obs,[best_search])
    y_obs = np.append(y_obs,[measure(np.asarray([best_search]))])

plt.show()

I suspect this happens because we keep sampling where we are most certain, thus making l(x) more and more narrow around this point, which doesn't change where we sample at all. So where is my understanding lacking?

score 1 · Accepted Answer · answered Oct 13 '19 at 18:05

So, I am still learning about TPE as well. But here's are the two problems in this code:

This code will only evaluate a few unique point. Because the best location is calculated based on the best recommended by the kernel density functions but there is no way for the code to do exploration of the search space. For example, what acquisition functions do.
Because this code is simply appending new observations to the list of x and y. It adds a whole lot of duplicates. The duplicates lead to a skewed set of observations and that leads to a very weird split and you can easily see that in the later plots. The eval_measure starts as something similar to the objective function but diverges later on.

If you remove the duplicates in x_obs and y_obs you can remove the problem no. 2. However, the first problem can only be removed through the addition of some way of exploring the search space.

Thanks for the comment. I figured out what was wrong, and it was indeed what you pointed out; instead of searching the most certain point (leading to duplicates), the new search point is drawn from the approximated distribution instead. This page really helped me understand the TPE algorithm in the end (if you are still learning) https://dkopczyk.quantee.co.uk/hyperparameter-optimization/ . — Søren Jensen, Nov 22 '19 at 10:19

Understading hyperopt's TPE algorithm

1 Answers1