2

I have a numpy array containing x variables and a y variable that I'd like to use to calculate coefficients on where each coefficient is between 0 and 1 and the sum of all the weights equals 1. How would I go about doing this in Python. I'm using Gekko currently and am only getting weights that are equal to 0 or a single feature with a weight of 1, and based on my knowledge of the data doesn't make sense. My actual data has over 100 features and 5k plus rows.

import numpy as np
from gekko import GEKKO

x = np.array([[15., 21., 13.5, 12., 18., 15.5],
              [14.5, 20.5, 16., 14., 19.5, 20.5]])
y = np.array([55.44456011, 55.70023835])

# Number of variables and data points
n_vars = x.shape[1]
n_data = y.shape[0]

# Create a Gekko model
m = GEKKO()

# Set up variables
weights = [m.Var(lb=0, ub=1) for _ in range(n_vars)]

# Set up objective function
y_pred = [m.Intermediate(m.sum([weights[i] * x[j, i] for i in range(n_vars)])) for j in range(n_data)]
objective = m.sum([(y_pred[i] - y[i]) ** 2 for i in range(n_data)])
m.Obj(objective)

# Constraint: sum of weights = 1
m.Equation(sum(weights) == 1)

# Set solver options for faster computation
m.options.SOLVER = 3  # Use IPOPT solver
m.options.IMODE = 3  # Set to optimization steady state mode
# m.options.APPENDEXE = 1  # Enable parallel computing

# Solve the optimization problem
m.solve(disp=False)

# Get the optimized weights
optimized_weights = [w.value[0] for w in weights]

finman69
  • 309
  • 1
  • 8
  • Can't reproduce. When I run this code, I get optimized_weights = [0.0, 1.0, 0.0, 0.0, 2.9158327869e-08, 0.0], which is not all zero. – Nick ODell Jul 08 '23 at 22:19
  • I've added to my original question that my data is much larger than this sample data. My issue is that I have over 5k rows of data with more columns, but this method seems to only give one variable a weight of 1 or no variables a weight at all. – finman69 Jul 08 '23 at 23:32
  • 1
    In this specific example, all of the Y variables are much larger than all of the X variables. No matter what weights it assigns, y_pred will always be smaller than y. Since it cannot get something close to Y, it puts all of the weight into the X variable which is on average the highest. The solution it's finding obeys your constraints and is optimal as far as I can see. If I replace y with `y = np.array([17, 17])`, then it assigns non-zero weights to every element of optimized_weights. – Nick ODell Jul 09 '23 at 02:04
  • I see, so would you recommend normalizing the data in some way? – finman69 Jul 10 '23 at 14:03
  • If you scale the Y variable, that's essentially the same thing as multiplying the coefficients by that scale, so your "coefficients add up to 1" rule would no longer be followed. You could also just remove that rule - I'm not clear from your question why that's important. – Nick ODell Jul 10 '23 at 15:29

0 Answers0