Producing a "best fit" slope gradient from pandas df and populating new columnb

Question

I'm trying to add a slope calculation on individual subsets of two fields in a dataframe and have that value of slope applied to all rows in each subset. (I've used the "slope" function in excel previously, although I'm not married to the exact algo. The "desired_output" field is what I'm expecting as the output. The subsets are distinguished by the "strike_order" column, subsets starting at 1 and not having a specific highest value.

"IV" is the y value "Strike" is the x value

Any help would be appreciated as I don't even know where to begin with this....

import pandas
df = pandas.DataFrame([[1200,1,.4,0.005],[1210,2,.35,0.005],[1220,3,.3,0.005],
[1230,4,.25,0.005],[1200,1,.4,0.003],[1210,2,.37,.003]],columns=
["strike","strike_order","IV","desired_output"])
df

    strike  strike_order    IV  desired_output
0   1200        1         0.40    0.005
1   1210        2         0.35    0.005
2   1220        3         0.30    0.005
3   1230        4         0.25    0.005
4   1200        1         0.40    0.003
5   1210        2         0.37    0.003

Let me know if this isn't a well posed question and I'll try to make it better.

sgDysregulation · Answer 1 · 2017-10-15T09:25:34.843

You can use numpy's least square We can rewrite the line equationy=mx+c as y = Ap, where A = [[x 1]] and p = [[m], [c]]. Then use lstsq to solve for p, so we need to create A by adding a column of ones to df

import numpy as np
df['ones']=1
A = df[['strike','ones']]
y = df['IV']
m, c = np.linalg.lstsq(A,y)[0]

Alternatively you can use scikit learn's linear_model Regression model

you can verify the result by plotting the data as scatter plot and the line equation as plot

import matplotlib.pyplot as plt
plt.scatter(df['strike'],df['IV'],color='r',marker='d')
x = df['strike']
#plug x in the equation y=mx+c
y_line = c + m * x
plt.plot(x,y)
plt.xlabel('Strike')
plt.ylabel('IV')
plt.show()

the resulting plot is shown below

Great thanks for this, it gets me part of the way there. – Benson Burns Oct 16 '17 at 07:31 — Benson Burns, Oct 16 '17 at 07:31

Scott Simpson · Answer 2 · 2017-10-20T04:59:11.910

Try this.

First create a subset column by iterating over the dataframe, using the strike_order value transitioning to 1 as the boundary between subsets

#create subset column
subset_counter = 0
for index, row in df.iterrows():
    if row["strike_order"] == 1:
      df.loc[index,'subset'] = subset_counter
      subset_counter += 1
    else:
      df.loc[index,'subset'] = df.loc[index-1,'subset']

df['subset'] = df['subset'].astype(int)

Then run a linear regression over each subset using groupby

# run linear regression on subsets of the dataframe using groupby
from sklearn import linear_model
model = linear_model.LinearRegression()
for (group, df_gp) in df.groupby('subset'):
    X=df_gp[['strike']]
    y=df_gp.IV
    model.fit(X,y)
    df.loc[df.subset == df_gp.iloc[0].subset, 'slope'] = model.coef_

df

   strike  strike_order    IV  desired_output  subset  slope
0    1200             1  0.40           0.005       0 -0.005
1    1210             2  0.35           0.005       0 -0.005
2    1220             3  0.30           0.005       0 -0.005
3    1230             4  0.25           0.005       0 -0.005
4    1200             1  0.40           0.003       1 -0.003
5    1210             2  0.37           0.003       1 -0.003

You're wasted in coal mining. – Benson Burns Oct 20 '17 at 03:17 — Benson Burns, Oct 20 '17 at 03:17

score 0 · Answer 3 · answered Oct 20 '17 at 09:35

@ Scott This worked except it went subset value 0, 1 and all subsequent subset values were 2. I added an extra conditional at the beginning and a very clumsy seed "seed" value to stop it looking for row -1.

    import scipy
    seed=df.loc[0,"date_exp"]
    #seed ="08/11/200015/06/2001C"
    #print(seed)
    subset_counter = 0
    for index, row in df.iterrows():
        #if index['strike_order']==0:
        if row['date_exp'] ==seed:
         df.loc[index,'subset']=0

        elif row["strike_order"] == 1:
        df.loc[index,'subset'] = subset_counter
         subset_counter = 1 + df.loc[index-1,'subset']
        else:
          df.loc[index,'subset'] = df.loc[index-1,'subset']

    df['subset'] = df['subset'].astype(int)

This now does exactly what I want although I think using the seed value is clunky, would have preferred to use if row == 0 etc. But it's friday and this works.

Cheers

Producing a "best fit" slope gradient from pandas df and populating new columnb

3 Answers3