I have a dataframe data with numerical values in columns X and Y. The procedure that I'm trying to replicate in python code is as follows:
- Assume that values in Y can be approximated as a function of values in X in the form of Y_i = a ln(X_i) + b, where a and b are some coefficients. Find these coefficients a and b.
- Throw out / ignore values in column Y which deviate from this curve more than d units, i.e. if some Y_i is such that Y_i > a ln(X_i) + b + d or Y_i < a ln(X_i) + b - d then we ignore it.
- Approximate the remaining data with a curve of the same form again and find new coefficients a and b.
So far I have something like that
import pandas as pd
import numpy as np
from scipy.optimize import curve_fit
data = pd.read_excel("filename.xlsm")
def func(s, a, b):
return a * np.log(s) + b
popt, pcov = curve_fit(func, data['X'], data['Y'])
a, b = popt
This answer says that iterating over rows should be avoided when possible so what would be the right way to perform the given procedure? Can anyone show me the right code?