0

I have a dataframe data with numerical values in columns X and Y. The procedure that I'm trying to replicate in python code is as follows:

  • Assume that values in Y can be approximated as a function of values in X in the form of Y_i = a ln(X_i) + b, where a and b are some coefficients. Find these coefficients a and b.
  • Throw out / ignore values in column Y which deviate from this curve more than d units, i.e. if some Y_i is such that Y_i > a ln(X_i) + b + d or Y_i < a ln(X_i) + b - d then we ignore it.
  • Approximate the remaining data with a curve of the same form again and find new coefficients a and b.

So far I have something like that

import pandas as pd
import numpy as np
from scipy.optimize import curve_fit

data = pd.read_excel("filename.xlsm")

def func(s, a, b):
    return a * np.log(s) + b

popt, pcov = curve_fit(func, data['X'], data['Y'])
a, b = popt

This answer says that iterating over rows should be avoided when possible so what would be the right way to perform the given procedure? Can anyone show me the right code?

Tarik
  • 10,810
  • 2
  • 26
  • 40
Hasek
  • 209
  • 1
  • 5
  • 13
  • You can add a column to the data frame with the calculated error. You can the filter the data frame if error threshold is exceeded. Side note: with a language such as Julia, you would not have to worry about performance when iteration. – Tarik Jul 29 '21 at 05:23

1 Answers1

1

You literally translate the formula into Python and then keep the rows that meet the condition:

mask = np.abs(a * np.log(data['X']) + b - data['Y']) < d
new_data = data[mask]
DYZ
  • 55,249
  • 10
  • 64
  • 93