Greetings, I am trying to perform bayesian optimization using the bayesian_optimization library with a custom kernel function, concretly a RBF version which uses the kendall distance.
I am trying to pass as an argument the kendall distance, to the cdist and pdist functions located in scipy.spatial.distance. I am reusing the code of the kendalltau function from scipy.stats.kendalltau, concretly, my kendall distance is defined as follows:
def kendall_distance(x,y):
perm = np.argsort(y) # sort on y and convert y to dense ranks
x, y = x[perm], y[perm]
y = np.r_[True, y[1:] != y[:-1]].cumsum(dtype=np.intp)
# stable sort on x and convert x to dense ranks
perm = np.argsort(x, kind='mergesort')
x, y = x[perm], y[perm]
x = np.r_[True, x[1:] != x[:-1]].cumsum(dtype=np.intp)
dis = _kendall_dis(x, y) # discordant pairs
return dis
With this distance, I define my custom kernel function,
class PermutationRBF(StationaryKernelMixin, NormalizedKernelMixin, Kernel):
def __init__(self, alpha=1.0, alpha_bounds=(1e-5, 1e5)):
self.alpha = alpha
self.alpha_bounds = alpha_bounds
@property
def anisotropic(self):
return np.iterable(self.alpha) and len(self.alpha) > 1
@property
def hyperparameter_length_scale(self):
if self.anisotropic:
return Hyperparameter("length_scale", "numeric",
self.alpha_bounds,
len(self.alpha))
return Hyperparameter(
"alpha", "numeric", self.alpha_bounds)
def __call__(self, X, Y=None, eval_gradient=False):
X = np.atleast_2d(X)
alpha = _check_length_scale(X, self.alpha)
if Y is None:
dists = pdist(X / alpha, kendall_distance) # First change
K = np.exp(-.5 * dists)
# convert from upper-triangular matrix to square matrix
K = squareform(K)
np.fill_diagonal(K, 1)
else:
if eval_gradient:
raise ValueError(
"Gradient can only be evaluated when Y is None.")
dists = cdist(X / alpha, Y / alpha, kendall_distance) # Second change
K = np.exp(-.5 * dists)
if eval_gradient:
if self.hyperparameter_length_scale.fixed:
# Hyperparameter l kept fixed
return K, np.empty((X.shape[0], X.shape[0], 0))
elif not self.anisotropic or alpha.shape[0] == 1:
K_gradient = \
(K * squareform(dists))[:, :, np.newaxis]
return K, K_gradient
elif self.anisotropic:
# We need to recompute the pairwise dimension-wise distances
K_gradient = (X[:, np.newaxis, :] - X[np.newaxis, :, :]) ** 2 \
/ (alpha ** 2)
K_gradient *= K[..., np.newaxis]
return K, K_gradient
else:
return K
The only change from the original (tagged with comments) is to pass the kendall_distance function as an argument to the cdist and pdist functions.
The problem appears when I run the optimization with that kernel, the performance is very slow compared with the RBF. The computation of Kendall distance O(n log n) is harder than the Euclidean O(n), however, the difference should not be so noticeable in small sizes.
Custom kernel
iteration: 0
time: 0.0008704662322998047
iteration: 1
time: 1.2141697406768799
iteration: 2
time: 2.3469510078430176
iteration: 3
time: 3.5015127658843994
iteration: 4
time: 4.695566892623901
RBF kernel
iteration: 0
time: 0.000415802001953125
iteration: 1
time: 0.020179033279418945
iteration: 2
time: 0.033345937728881836
iteration: 3
time: 0.033483028411865234
iteration: 4
time: 0.0286252498626709
You can see the complete notebook and the results here.
I think the problem is because if I call pdist and cdist functions with some integrated distance, like the Euclidean, the functions runs the following code
elif isinstance(metric, str):
mstr = metric.lower()
mstr, kwargs = _select_weighted_metric(mstr, kwargs, out)
metric_name = _METRIC_ALIAS.get(mstr, None)
if metric_name is not None:
X, typ, kwargs = _validate_pdist_input(X, m, n,
metric_name, **kwargs)
if 'w' in kwargs:
metric_name = _C_WEIGHTED_METRICS.get(metric_name, metric_name)
# get pdist wrapper
pdist_fn = getattr(_distance_wrap,
"pdist_%s_%s_wrap" % (metric_name, typ))
pdist_fn(X, dm, **kwargs)
return dm
which I suppose will be programmed and optimized in C. On the other hand, if I use my custom metric, the executed code fragment is
if callable(metric):
mstr = getattr(metric, '__name__', 'UnknownCustomMetric')
metric_name = _METRIC_ALIAS.get(mstr, None)
if metric_name is not None:
X, typ, kwargs = _validate_pdist_input(X, m, n,
metric_name, **kwargs)
k = 0
for i in range(0, m - 1):
for j in range(i + 1, m):
dm[k] = metric(X[i], X[j], **kwargs)
k = k + 1
which is pure python, slower by default.
Is it possible that this is the reason of the bottleneck? In case it is, I need more efficiency in my execution, is there any way to improve this?
I leave the test code again here Complete Notebook
Edit with new information:
As the comments suggest, I use line_profiler to found the bottleneck and these are the obtained results simplified:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
.
.
.
58 1 879.0 879.0 92.4 dm = calculate_pdist_dm(metric,dm,m,X)
.
.
.
where calculate_pdist_dm is as follows
def calculate_pdist_dm(metric,dm,m,X):
k = 0
for i in range(0, m - 1):
for j in range(i + 1, m):
dm[k] = metric(X[i], X[j])
k = k + 1
return dm
and the metric function is the previously commented Kendall distance with the following time results.
Line # Hits Time Per Hit % Time Line Contents
==============================================================
2 @do_profile()
3 def kendall_distance(x,y):
4 1 11.0 11.0 6.9 perm = np.argsort(y) # sort on y and convert y to dense ranks
5 1 1.0 1.0 0.6 x, y = x[perm], y[perm]
6 1 59.0 59.0 37.1 y = np.r_[True, y[1:] != y[:-1]].cumsum(dtype=np.intp)
7
8 # stable sort on x and convert x to dense ranks
9 1 9.0 9.0 5.7 perm = np.argsort(x, kind='mergesort')
10 1 1.0 1.0 0.6 x, y = x[perm], y[perm]
11 1 50.0 50.0 31.4 x = np.r_[True, x[1:] != x[:-1]].cumsum(dtype=np.intp)
12
13 1 28.0 28.0 17.6 dis = _kendall_dis(x, y) # discordant pairs
14 1 0.0 0.0 0.0 return dis
Timer unit: 1e-06 s
Total time: 0.000159 s
Also say that if I use a pure Python implementation for Kendall distance,
def kendall_distance(x,y):
distance = 0
for i in range(len(x)):
for j in range(i,len(x)):
a = x[i] - x[j]
b = y[i] - y[j]
if (a * b < 0):
distance += 1
return distance
I can use numba to execute more efficiently but I'm still far from the efficiency of the built-in distances (results below in more advanced iterations 95-99 where Bayesian optimization becomes computationally expensive). Is there any way to improve this?
Custom kernel with Kendall distance and numba
iteration: 95
time: 0.3591618537902832
iteration: 96
time: 0.41269588470458984
iteration: 97
time: 0.40320730209350586
iteration: 98
time: 0.40665769577026367
iteration: 99
time: 0.37867259979248047
RBF kernel with the scipy built-in Euclidean distance
iteration: 95
time: 0.04483485221862793
iteration: 96
time: 0.046830177307128906
iteration: 97
time: 0.03493475914001465
iteration: 98
time: 0.03614163398742676
iteration: 99
time: 0.042229413986206055
The new code with the use of numba is here