3

I need to find the minimum of a cost function with several thousand variables. The cost function is simply a least squares calculation and can be computed easily and quickly with numpy vectorization. Despite this, the optimization still takes a painstakingly long time. My guess is that the slow runtime is occuring in SciPy's minimizer rather than my cost function. How can I change the parameters of SciPy's minimizer to speed up the runtime?

Sample Code:

import numpy as np
from scipy.optimize import minimize

# random data
x = np.random.randn(100, 75)

# initial weights guess
startingWeights = np.ones(shape=(100, 75))

# random y vector 
y = np.random.randn(100)


def costFunction(weights):
   # reshapes flattened weights into 2d matrix
   weights = np.reshape(weights, newshape=(100, 75))

   # weighted row-wise sum
   weighted = np.sum(x * weights, axis=1)

   # squared residuals
   residualsSquared = (y - weighted) ** 2

   return np.sum(residualsSquared)


result = minimize(costFunction, startingWeights.flatten())
John Sorensen
  • 710
  • 6
  • 29
  • 1
    Do you have an example code with your cost function and the minimizer you used? Some minimizers can be slow, for eg: Nelder Mead, whereas some are a bit faster for eg COBYLA. And since many of these minimizers often call the FORTRAN implementation or similar, the solvers are basically as fast as it gets. Still, tweaking some parameters might be useful for increasing speed, for eg, max iteration. – cmbfast Jul 24 '21 at 05:36
  • 1
    In addition to cmbfast's comment: Did you provide the exact objective gradient and hessian? If not, both are approximated by finite differences which in turn leads to many evaluations of the objective for **each** evaluation of the gradient. Hence, providing the exact gradient and hessian can significantly speed up the optimization. – joni Jul 24 '21 at 05:59
  • Look at the full output, which should tell you the number of calls to your objective. – hpaulj Jul 24 '21 at 06:06
  • Added sample code. I used the base model minimizer without any specification of method. Note that my approach is inspired by this: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1971363 – John Sorensen Jul 24 '21 at 06:07
  • My guess is that minimizing 7500 values requires many function calls. I wouldn't be surprised if the time increased exponentially with number of values. – hpaulj Jul 24 '21 at 07:02
  • 1
    Scipy is not designed for large problems. – Erwin Kalvelagen Jul 24 '21 at 12:18
  • for large/multidimensional ds see [Statistical learning approaches for global optimization](https://yandex.by/search/?text=global+optimization+by+batches+&clid=2233627&lr=157) & execute optimization by batches - in parallel... – JeeyCi Aug 05 '23 at 10:52
  • sometimes, "replacing the cost function with root-mean-square-error rather than sum-of-squared-error [helps](https://stackoverflow.com/questions/76122980/fitting-data-with-scipy-optimize-minimize-with-both-constraints-and-bounds/76124870#76124870)" – JeeyCi Aug 05 '23 at 15:38

2 Answers2

3

As already noted in the comments, it's highly recommended to provide the exact objective gradient for a large problem with N = 100*75 = 7500 variables. Without a provided gradient, it will be approximated by finite differences and by means of the approx_derivative function. However, finite differences are error-prone and computationally expensive due to the fact that each evaluation of the gradient requires 2*N evaluations of the objective function (without caching).

This can be easily illustrated by timing the objective and the approximated gradient:

In [7]: %timeit costFunction(startingWeights.flatten())
23.5 µs ± 2.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [8]: from scipy.optimize._numdiff import approx_derivative

In [9]: %timeit approx_derivative(costFunction, startingWeights.flatten())
633 ms ± 33.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Consequently, each gradient evaluation takes more than half of a second on my machine! A more efficient approach to evaluate the gradient is algorithmic differentiation. Using the JAX library, it's quite easy:

import jax.numpy as jnp
from jax import jit, value_and_grad

def costFunction(weights):
   # reshapes flattened weights into 2d matrix
   weights = jnp.reshape(weights, newshape=(100, 75))

   # weighted row-wise sum
   weighted = jnp.sum(x * weights, axis=1)

   # squared residuals
   residualsSquared = (y - weighted) ** 2

   return jnp.sum(residualsSquared)

# create the derivatives
obj_and_grad = jit(value_and_grad(costFunction))

Here, value_and_grad creates a function that evaluations the objective and the gradient and returns both, i.e. obj_value, grad_values = obj_and_grad(x0). So let's time this function:

In [12]: %timeit obj_and_grad(startingWeights.flatten())
132 µs ± 6.62 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Thus, we now evaluate the objective and the gradient nearly 5000 times faster than before. Finally, we can tell minimize that our objective function returns the objective and the gradient by setting jac=True. So

minimize(obj_and_grad, x0=startingWeights.flatten(), jac=True)

should significantly speed up the optimization.

PS: You can also try the state-of-the-art Ipopt solver interfaced by the cyipopt package. It also provides a scipy-like interface similar to scipy.optimize.minimize.

joni
  • 6,840
  • 2
  • 13
  • 20
  • thanks for [JAX - Automatic differentiation](https://www.linkedin.com/advice/0/what-some-common-methods-compute-jacobian) – JeeyCi Aug 04 '23 at 07:16
0

The cost function is simply a least squares calculation

do not use minimize, use least-squares - if it is exactly, that you need, - Example:

import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import least_squares

x= np.array([0.23, 0.66, 0.93, 1.25, 1.75, 2.03, 2.24, 2.57, 2.87, 2.98])
y= np.array([0.25, -0.27, -1.12, -0.45, 0.28, 0.13, -0.27, 0.26, 0.58, 1.03])    
plt.plot(x,y, 'bv')
plt.show()

def model(theta, t):
    return theta[0]* np.exp(t) + theta[1]*np.log(t) + theta[2]*np.sin(t) + theta[3]*np.cos(t)      #  a ∗ e^x   + b ∗ l n x + c ∗ s i n x + d ∗ c o s x

def residual(theta, t, y):       # f-a
     return (model(theta, t) - y).flatten()

theta = np.array([0.5, 0.5, 0.5, 0.5])

res = least_squares(residual, theta,   args=(x, y),  verbose=1)        # jac=jac,
print(res)

x_test = np.linspace(0, 3)
y_test = model(res.x, x_test)
plt.plot(x,y, 'bo', x_test, y_test, 'k-')
plt.show()
  • as you see there's NO place for your huge x0=startingWeights= shape(100,75) (as well, for your minimization method you need ONLY x0=shape(75) as init_guess of coeffs_- if you really need such) - because x0 is just init_coeffs_param (or theta) to optimize vs. your init_data (100,75)-xs...

  • optimization procedure is being executed locally for each your data_point (out of 100 you're having ys) , even if added the logics of regularization

  • if your problem's formalization is correct?... if you could prefer Distance metric prior to Optimization?.. or choose Global_Optimization algos to minimize the whole space you have?.. or manually create Jacobian_fx for minimization method (as param - as joni answered)?..

  • anyway, I think, you'd better not reformulating your Convex Problem to LS or can consider only if Linear-LS-Problem to Convex-Problem , if in reallity you need global optimum of data shape(100,75))...

  • again, I doubt, that your point consists of such amount (75) of ! Linearly-Independent dimensions - regularization can help... - either change your objective for minimization or apply any tip I mentioned

If it is - can see LS-Optimization -- some opportunities described (incl. Jacobi method (e.g. impl), preconditioning, Conjugate gradient for sparse matrix [that becomes your multidimensional data, using 2-norm condition number, if your data is spatial] - make rotation for the better choice of step-direction in minimization, etc.)...

P.S. When matrices grow up - matrix-factorization needed

JeeyCi
  • 354
  • 2
  • 9
  • indeed, there is a little difference in [minimize vs. least_squares](https://stackoverflow.com/questions/49211783/why-does-scipy-optimize-least-squares-exist-when-scipy-optimize-minimize-could-p) – JeeyCi Aug 04 '23 at 07:38
  • with scipy.optimize.least_squares you, probably, can even calculate [standard deviation errors](https://stackoverflow.com/a/42662768/15893581) – JeeyCi Aug 04 '23 at 07:47
  • [correct formalization of least_squares problem](https://docs.scipy.org/doc/scipy-1.6.3/reference/tutorial/optimize.html#least-squares-minimization-least-squares), vs. simple minimization problem, as desired by OP – JeeyCi Aug 04 '23 at 08:12
  • BTW, [overdetermined problem Ax = b](https://docplayer.net/storage/114/213176966/1692004485/EgUzJi1_C_dW3TuvbrVPcA/213176966.pdf) – JeeyCi Aug 14 '23 at 08:36