2

I am trying to extrapolate future data points from a data set that contains one continuous value per day for almost 600 days. I am currently fitting a 1st order function to the data using numpy.polyfit and numpy.poly1d. In the graph below you can see the curve (blue) and the 1st order function (green). The x-axis is days since beginning. I am looking for an effective way to model this curve in Python in order to extrapolate future data points as accurately as possible. A linear regression isnt accurate enough and Im unaware of any methods of nonlinear regression that can work in this instance.

This solution isnt accurate enough as if I feed enter image description here

x = dfnew["days_since"]
y = dfnew["nonbrand"]

z = numpy.polyfit(x,y,1)
f = numpy.poly1d(z)

x_new = future_days
y_new = f(x_new)

plt.plot(x,y, '.', x_new, y_new, '-')

EDIT:

I have now tried the curve_fit using a logarithmic function as the curve and data behaviour seems to conform to:

def func(x, a, b):
  return a*numpy.log(x)+b

x = dfnew["days_since"]
y = dfnew["nonbrand"]

popt, pcov = curve_fit(func, x, y)

plt.plot( future_days, func(future_days, *popt), '-')

However when I plot it, my Y-values are way off:

enter image description here

BLL27
  • 921
  • 5
  • 13
  • 27
  • 1
    A very easy way is: First look at the graph and think of a parametric family of functions that graph could belong to. Maybe some logarithmic function? Then use `curve_fit` from scipy to find the concrete parameters and use that function for extrapolation. – cel Aug 27 '15 at 05:12
  • Thanks, I have tried that and would appreciate your feedback on my edit. – BLL27 Aug 27 '15 at 06:04
  • 2
    It's a little bit cumbersome to help you since I cannot try things out myself. `a*numpy.log(x)+b` seems very problematic. What happens if you allow a x-axis shift as well? `a*numpy.log(x + b) + c`? – cel Aug 27 '15 at 06:19
  • Excellent! That's the nut I was trying to crack, thank you. That curve is a pretty good fit to the expected behaviour of my data and will probably provide a good solution to the task. – BLL27 Aug 27 '15 at 06:26
  • 1
    Glad it worked. Trying radical functions may also give you good results. Comparing the logarithm to a square root e.g. could make sense. – cel Aug 27 '15 at 06:35
  • What if I wanted to slow the rate of decay of the curve. How could I write this in function form? – BLL27 Aug 28 '15 at 00:34

1 Answers1

1

The very general rule of thumb is that if your fitting function is not fitting well enough to your actual data then either:

  • You are using the function wrong, e.g. You are using 1st order polynomials - So if you are convinced that it is a polynomial then try higher order polynomials.
  • You are using the wrong function, it is always worth taking a look at:

    1. your data curve &
    2. what you know about the process that is generating the data

    to come up with some speculation/theorem/guesses about what sort of model might fit better.

Might your process be a logarithmic one, a saturating on, etc. try them!

Finally, if you are not getting a consistent long term trend then you might be able to justify using cubic splines.

Steve Barnes
  • 27,618
  • 6
  • 63
  • 73
  • Thanks, I think a logarithmic function is along the right lines of what Im looking for as my Y-value rate of increase diminishes with time, which is the expected behaviour for the data. I have tried implementing `curve-fit` but am getting an odd graph output as shown in my edit above. I would appreciate your feedback on it. – BLL27 Aug 27 '15 at 06:07
  • You need to plot the log of your data as well as the predictor values, then you can compare & just change your tick labels to the anti-log of the numbers that they would be. – Steve Barnes Aug 27 '15 at 06:13
  • I'm a little unclear on what you mean by this. Could you provide an example? – BLL27 Aug 27 '15 at 06:17
  • 1
    http://stackoverflow.com/questions/6431248/matplotlib-logarithmic-scale-but-require-non-logarithmic-labels gives you the basics. – Steve Barnes Aug 27 '15 at 06:32