1

I am trying to find a relationship between two variables (pv_ratio, battery_ratio) and a third variable 'value'. Both ratios range from 0 to 5 with points every 0.0625 (81x81=6561 points) and 'value' falls within [0,1].

The csv can be found here and looks like that:

    battery_ratio   pv_ratio    value
0   0.0000  0   1
1   0.0625  0   1
2   0.1250  0   1
3   0.1875  0   1
4   0.2500  0   1
5   0.3125  0   1
6   0.3750  0   1
7   0.4375  0   1
8   0.5000  0   1
9   0.5625  0   1

These plots give an idea of the relationships between my variables:

Pairplot

Here is my code to fit my curve, using sicpy.optimize.curve_fit and looking for exponential relationships. This code snippet reads the csv into a pandas df, finds optimal parameters for the f function, plots the results and gives a score to the fit.

I've been working in an iterative manner, trying many formulas for f, improving the score little by little.

from scipy.optimize import curve_fit
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = (14.0, 8.0)

def f(X, a, b, c, d, e):
# the function I came up with after some trials, and which I'm trying to improve
    bt_r = X[:,0]  #battery_ratio
    pv_r = X[:,1] #pv_ratio
    return  (1 - a * np.exp(- e * pv_r ** b)) * np.exp(- (d ** bt_r) * c * pv_r)

def fit():
#find optimal parameters and score fit
    X = df[variables].values
    y = df.value.values
    popt, pcov = curve_fit(f, X, y)
    y_real, y_fit = df['value'], f(df[variables].values, *popt)
    score = np.round(np.sum(((y_fit - y_real)**2)),1)
    return popt, score
        
def check_fit(values):
    #Plot (y_fit, y) for all subsets
    def plot_subset(ax, variable, r_value):
        """Scatter plot (y_fit and y) against 'variable' with the other variable set at ratio
        - variable : string ['pv_ratio', 'battery_ratio']
        - r_value : float 
        """
        # ratio stands for the second variable which is fixed
        ratio = list(set(variables) - set([variable]))[0]
        df_ = df.query("{} == {}".format(ratio, r_value))

        # plot y and y fit
        y_real, y_fit = df_['value'], f(df_[variables].values, *popt)
        for y, c in zip([y_real, y_fit], ['b', 'r']):        
            ax.scatter(df_[variable], y, color=c, s=10, alpha=0.95)
        ax.set_title('{} = {}'.format(ratio, r_value))

    fig, ax = plt.subplots(nrows=2, ncols=len(values), sharex=True, sharey=True)
    for icol, r_value in enumerate(values):
        plot_subset(ax[0, icol], 'pv_ratio', r_value)
        plot_subset(ax[1, icol], 'battery_ratio', r_value)
        
    fig.tight_layout()
    print 'Score: {}'.format(score)
    

df = pd.read_csv('data.csv', index_col=0)
variables = ['battery_ratio', 'pv_ratio']
popt, score = fit()
check_fit([0,3,5]) #plot y_real and y_fit for these ratios

The code above yields the following picture (blue: real, red: fit) and gives a score to the fit.

Score

The best score (=sum((y_real - y_fit)²/len(y))) I can get is 9.3e-4, which is still not very good in practice, especially on the ramp-up phase.

I'm now stuck at a point where the try-and-repeat process shows its limits. How should I work to design faster and more efficiently my fitting function? Can I get a better score than 6.1?

mkrieger1
  • 19,194
  • 5
  • 54
  • 65
jodoox
  • 821
  • 2
  • 7
  • 22
  • Generally, one tries to fit to an equation that you believe has a physical/chemical/engineering basis. Shotgunning fit equations may result in a better fit, but absolutely no insight in to how the variables trade off with each other (and thus no predictive powers outside of the given data set). So, go back to the problem and figure out how the value should depend on the two variables. – Jon Custer Mar 01 '16 at 14:57
  • Sure. In this particular case though, I'm just trying to replicate values from a model built by a third-party ([link here](http://pvspeicher.htw-berlin.de/unabhaengigkeitsrechner/)) and I don't have access to their assumptions. I'm only trying to replicate the results on this restrained range, which would allow me to replace the current dirty lookup in my 6561-row table + 2D-interpolation by a formula. Doesn't it make sense as these curves look fairly simple? – jodoox Mar 01 '16 at 15:22
  • Than I would go straight to something like RectBivariateSpline and just directly use the spline interpolation functions. – Jon Custer Mar 01 '16 at 15:46
  • Right, this works fine on Python but I need to integrate that feature into an excel spreadsheet. The end-users don't have Python on their laptop and don't always have access to internet. I thought it would possible to come up with a formula interpolating the function on this somewhat narrow range, which still seems more elegant to me than the lookup thing in Excel. – jodoox Mar 01 '16 at 16:12
  • Ahhhh.... I see. OK, carry on. Now, I would note that you should be normalizing your best score by the number of data points to get an idea of how close it is. But even then 0.001 may not be good enough? – Jon Custer Mar 01 '16 at 16:17
  • Thanks for your help! I normalized the score in my formula (divided by 6561), the score is indeed quite low (~9.3e-4) but this seems mostly due to the higher ratios which converge quickly and are easier to fit. Indeed, the score goes up to ~1.2e-3 when I filter out ratios greater than 3 (see the scatter plot where battery_ratio=0). I'm often ~5-10% off on the ramp-up phase which matters the most. I wouldn't mind making my formula more complex, isn't there a brutal way to add 5 (or whatever) parameters so that the greater residual stays below 1-2%? So far, I only got worse results doing so... – jodoox Mar 01 '16 at 17:01
  • 1
    You could multiply by some n-dimensional polynomial and see what happens as n varies... – Jon Custer Mar 01 '16 at 18:43
  • Multiplying my best function by a 2d order polynomial didn't really help, my score goes from 9.37e-4 to 9.25e-4 and the maximum residual stays at 24%. I'll go for the full polynomial solution for the moment (see my answer) which is somehow simpler. Thanks for your help. – jodoox Mar 02 '16 at 09:46

2 Answers2

0

As suggested by @jon-custer, I tried a n-polynomial fit. My code is a slightly modified version of this SO answer.

import itertools
import numpy as np
import matplotlib.pyplot as plt

def polyfit2d(data, order=3):
    x = data.pv_ratio
    y = data.battery_ratio
    z = data.value
    
    ncols = (order + 1)**2
    G = np.zeros((x.size, ncols))
    
    ij = itertools.product(range(order+1), range(order+1))
    for k, (i,j) in enumerate(ij):
        G[:,k] = x**i * y**j
    m, _, _, _ = np.linalg.lstsq(G, z)
    
    y['fit'] = polyval2d(x, y, m)
    return m, y_fit

def polyval2d(x, y, m):
    order = int(np.sqrt(len(m))) - 1
    ij = itertools.product(range(order+1), range(order+1))
    z = np.zeros_like(x)
    for a, (i,j) in zip(m, ij):
        z += a * x**i * y**j  
    return z

m, y_fit = polyfit2d(df, 7)

Polynomials

The chart above shows the maximum residual, and the normalized score. The best results I get are for a degree 7 polynomial. My score goes down to ~6.4e-5 with residuals never greater than 5.5%, that's an accuracy I'm fine with.

mkrieger1
  • 19,194
  • 5
  • 54
  • 65
jodoox
  • 821
  • 2
  • 7
  • 22
0

This is not exactly Python-related, you want to fit your data into a surface.

Revert the data. Do a 1/x in values and make a trend line, line by line. You did it.

mkrieger1
  • 19,194
  • 5
  • 54
  • 65