How can I find a well-fitting trendline in Matplotlib for this data?

Question

my problem is allegedly simple - I have scatter data in X and Y, and want to get a nice, well-fitting trendline with a known equation so that I can go on to correspond LDR voltages into power readings. However, I'm having trouble with generating a trendline in Matplotlib or Scipy that fits well, which I believe is because there's a logarithmic relationship.

I'm using Spyder and Matplotlib, and first tried plotting the X (Thorlabs) and Y (LDR) data as a log-log scatter plot. Because the data didn't seem to show a linear relationship after doing this, I then used numpy's Polynomial.fit with degree 5 to 6. This looked good, but then when inverting the axes, so I could get something of the form [LDR] = f[Thorlabs], I noticed the fit was suddenly not very good at all at the extremes of my data.

Using this question using curve_fit seems to be the way to go, but I tried using curve_fit as described here and, after adjusting to increase the max number of curve-fit iterations, stumbled when I got the error message "TypeError: can't multiply sequence by non-int of type 'numpy.float64'", which will likely be because my data contains decimal points. I'm not sure how to account for this.

I have several mini-questions, then -

am I misunderstanding the above examples?
is there a better way I could go about trying to find the ideal trendline for this data? Is it possible that it's some sort of logarithmic relationship on top of a log-log plot?
once I get a trendline, how can I make sure it fits well and can be displayed?


#import libraries
import matplotlib.pyplot as plt
import csv
import numpy as np
from numpy.polynomial import Polynomial
import scipy.optimize as opt

#initialise arrays - I create log arrays too so I can plot directly
deg = 6    #degree of polynomial fitting for Polynomial.fit()
thorlabs = []
logthorlabs = []

ldr = []
logldr = []

#read in LDR/Thorlabs datasets from file
with open('16ldr561nm.txt','r') as csvfile:
    plots = csv.reader(csvfile, delimiter='\t')
    for row in plots:
        thorlabs.append(float(row[0]))
        ldr.append(float(row[1]))
        logthorlabs.append(np.log(float(row[0])))
        logldr.append(np.log(float(row[1])))
#This seems to work just fine, I now have arrays containing data in float

#fit and plot log polynomials
p = Polynomial.fit(logthorlabs, logldr, deg)
plt.plot(*p.linspace()) #plot lines


#plot scatter graphs on log-log axis - either using log arrays or on loglog plot
#plt.loglog()
plt.scatter(logthorlabs, logldr, label='16bit ADC LDR1')
plt.xlabel('log Thorlabs laser power (microW)')
plt.ylabel('log LDR voltage (mV)')
plt.title('LDR voltage against laser power at 561nm')
plt.legend()
plt.show()


#attempt at using curve_fit - when using, comment out the above block
"""
# This is the function we are trying to fit to the data.
def func(x, a, b, c):
     return a * np.exp(-b * x) + c
#freaks out here as I get a type error which I am not sure how to account for


# Plot the actual data
plt.plot(thorlabs, ldr, ".", label="Data");

#Adjusted maxfev to 5000. I know you can make "guesses" here but I am not sure how to do so
# The actual curve fitting happens here
optimizedParameters, pcov = opt.curve_fit(func, thorlabs, ldr, maxfev=5000);

# Use the optimized parameters to plot the best fit
plt.plot(thorlabs, func(ldr, *optimizedParameters), label="fit");

# Show the graph
plt.legend();
plt.show();
"""

When using curve_fit, I get a "TypeError: can't multiply sequence by non-int of type 'numpy.float64'".

As I don't have enough reputation to post images, my raw dataset can be found here. (Otherwise I'd include the graphs!)

(Note that I actually have two datasets, but as I only want to know the principle for calculating a trendline for one, I've left out the other dataset above.)

In physics it mostly makes sense to first try to come up with a model and then verify if the model explains the experiment. What this means is: It doesn't make sense to fit some arbitrary function to the data if there is no reason to believe that this function describes the relationship between the two quantities. The notion "ideal trendline" is hence also a bit questionable. — ImportanceOfBeingErnest, Aug 06 '19 at 12:03
Concerning the programming part of this: The type error comes from trying to multiply a list with a float, (`[1,2,3] * 4.3`) Best use numpy arrays throughout your code. — ImportanceOfBeingErnest, Aug 06 '19 at 12:06
Thank you, that would explain the type error! My approach here was a bit of a naive attempt to just convert voltage into a representation of laser power via an empirical relationship that I know is reproducible - definitely not the tightest way to do things, and upon further advice from a friendly neighbourhood postdoc advisor I'll likely try a different tactic next. Certainly seems to be a good way to figure out how plotting in Python works, though (it's my first time!) — Rooksoup, Aug 06 '19 at 13:42

AKX · Answer 1 · 2019-08-06T12:39:31.953

Refactoring your code a bit, most importantly to use native Numpy arrays once things have been parsed out from the file, makes things not crash, but the CurveFit line doesn't look good at all.

The code prints out the parameters fit by curve_fit, which don't look very good either, and a warning too: "Covariance of the parameters could not be estimated". I'm no mathematician/statistician, so I don't know what to do there.

from numpy.polynomial import Polynomial
import csv
import matplotlib.pyplot as plt
import numpy as np
import scipy.optimize as opt


def read_dataset(filename):
    x = []
    y = []

    with open(filename, "r") as csvfile:
        plots = csv.reader(csvfile, delimiter="\t")
        for row in plots:
            x.append(float(row[0]))
            y.append(float(row[1]))

    # cast to native numpy arrays
    x = np.array(x)
    y = np.array(y)
    return (x, y)


ldr, thorlabs = read_dataset("16ldr561nm.txt")

plt.scatter(thorlabs, ldr, label="Data")
plt.xlabel("Thorlabs laser power (microW)")
plt.ylabel("LDR voltage (mV)")
plt.title("LDR voltage against laser power at 561nm")

# Generate and plot polynomial
p = Polynomial.fit(thorlabs, ldr, 6)
plt.plot(*p.linspace(), label="Polynomial")


# Generate and plot curvefit
def func(x, a, b, c):
    return a * np.exp(-b * x) + c

optimizedParameters, pcov = opt.curve_fit(func, thorlabs, ldr)
print(optimizedParameters, pcov)
plt.plot(thorlabs, func(ldr, *optimizedParameters), label="CurveFit")

# Show everything
plt.legend()
plt.show()

If you really need to log() the data, it's easily done with

x = np.log(x)
y = np.log(y)

which will keep the arrays as NumPy arrays and be plenty faster than doing it "by hand".

How can I find a well-fitting trendline in Matplotlib for this data?

1 Answers1