Theil-Sen regression with sklearn on a log-log scale

Question

I'm trying to plot my data on a log-log scale, using Theil-Sen regression for the best fit line. However, when I plot work out my regression line on a log-scale 2, it's parallel to my x=y line, which I don't think is correct.

normal scale for X and y :

log-log scale for X and y :

I found a related solution by chaooder for Linear Regression on a semi-log scale to be somewhat helpful. So currently, my regression line would go from being:

y = ax + c on a linear scale to y = 10^^(a log(x)+c) on my log-log scale. But in my head, I can't understand how that has a solution as I cannot calculate a.

Here's the data:

index,x,y
0,0.22,0.26
1,0.39,0.1
2,0.4,0.17
3,0.56,0.41
4,0.57,0.12
5,0.62,0.54
6,0.78,0.99
7,0.79,0.35
8,0.8,0.33
9,0.83,0.91
10,0.95,0.81
11,1.08,0.23
12,1.34,0.11
13,1.34,0.44
14,1.35,0.11
15,1.58,0.24
16,1.66,0.71
17,2.11,0.54
18,2.13,0.42
19,2.19,1.72
20,2.25,2.16
21,2.39,0.95
22,2.4,0.16
23,2.73,0.92
24,2.87,1.1
25,2.96,0.27
26,3.12,1.66
27,3.26,0.06
28,3.28,0.68
29,3.34,0.7
30,3.38,1.14
31,3.39,1.81
32,3.41,0.19
33,3.49,1.4
34,3.52,1.57
35,3.6,0.99
36,3.64,1.28
37,3.65,1.68
38,3.89,1.66
39,3.93,1.64
40,4.01,1.04
41,4.07,0.32
42,4.22,0.68
43,4.52,0.57
44,4.53,0.59
45,4.56,0.7
46,4.6,1.15
47,4.62,1.31
48,4.68,1.09
49,5.03,0.48
50,5.06,0.7
51,5.31,0.62
52,5.41,0.21
53,5.45,2.06
54,6.0,0.72
55,6.06,0.36
56,6.64,1.41
57,6.74,0.59
58,6.96,0.95
59,7.01,1.13
60,7.14,1.56
61,7.14,2.82
62,7.19,1.49
63,7.21,0.88
64,7.23,1.31
65,7.55,0.76
66,7.72,0.5
67,7.75,1.65
68,7.77,1.48
69,7.9,1.8
70,7.95,0.68
71,8.03,1.12
72,8.09,2.61
73,8.86,1.71
74,9.31,0.23
75,9.5,2.35
76,9.62,1.84
77,9.91,0.56
78,9.95,1.67
79,10.4,1.15
80,10.8,0.88
81,11.28,1.8
82,11.31,1.58
83,11.43,1.0
84,12.38,2.83
85,13.38,1.45
86,13.9,1.99
87,30.3,1.99

And my current code:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import AutoMinorLocator
from sklearn.linear_model import TheilSenRegressor

for log in [True, False]:
    fig,ax = plt.subplots()
    data.plot.scatter(ax=ax,
                      x='x',
                      y='y',
                      loglog=log)

    vmin = np.amin(data[['x','y']].min().values)*0.8
    vmax = np.amax(data[['x','y']].max().values)*1.25    
  
    ax.set_xlim(vmin,vmax)
    ax.set_ylim(vmin,vmax)
    ax.yaxis.set_minor_locator(AutoMinorLocator())
    ax.xaxis.set_minor_locator(AutoMinorLocator())

    # best fit (ThielSen) line
    X = data.x.values[:,np.newaxis]
    y = data.y.values

    if log:
        X = np.log10(X)
        y = np.log10(y)
      
    if len(y) > 0:
        estimator = TheilSenRegressor(fit_intercept=False) # intentionally set intercept to 0
        estimator.fit(X=X,y=y)

        y0 = y[0]
        x0 = X[0]
    
        y_pred = estimator.predict(np.array([vmin,vmax]).reshape(2,1)) 
#       y_pred = np.power(10,(estimator.predict(X)))

        gradient = (y_pred[1] - y_pred[0]) / (vmax - vmin)
        intercept = y_pred[1] - gradient * vmax
        print(f'gradient: {gradient} \n intercept: {intercept}')

    # Theil-Sen regression line
    ax.plot([vmin,vmax],y_pred,color='red',lw=1,zorder=1,label='Best fit')
    
    # 1:1 ratio line (black, dashed)
    ax.plot([vmin,vmax],[vmin,vmax],lw=1,color='black',ls='--',alpha=0.6,zorder=1,
            label='1:1 correlation')
    if log:
        ax.set_xscale('log');ax.set_yscale('log')
        ax.set_title('log-log scale')
        fig.savefig('TS_regression_loglog.png')
    else:
        ax.set_title('normal scale')
        fig.savefig('TS_regression_normalscale.png')

score 1 · Answer 1 · answered Jan 18 '22 at 13:02

If you fitted on log-log, the input for prediction needs to be on the log scale, and you need to transform the prediction before plotting them. These are the lines in question where it's not consistent in terms of scale:

y_pred = estimator.predict(np.array([vmin,vmax]).reshape(2,1))
[..]
ax.plot([vmin,vmax],y_pred,color='red',lw=1,zorder=1,label='Best fit')

Define some of the variables in your code, note you should get the intercept and gradient from the fit:

vmin = np.amin(data[['x','y']].min().values)*0.8
vmax = np.amax(data[['x','y']].max().values)*1.25    
  
X = data.x.values[:,np.newaxis]
y = data.y.values

With a slight modification to your code:

vmin = np.amin(data[['x','y']].min().values)*0.8
vmax = np.amax(data[['x','y']].max().values)*1.25    
  
X = data.x.values[:,np.newaxis]
y = data.y.values

fig,ax = plt.subplots()
data.plot.scatter(ax=ax,x='x',y='y')

estimator = TheilSenRegressor(fit_intercept=False) # intentionally set intercept to 0
estimator.fit(X=np.log10(X),y=np.log10(y))

gradient = estimator.coef_[0]
intercept = estimator.intercept_
print([gradient,intercept])

y_pred = estimator.predict(np.log10([vmin,vmax]).reshape(2,1)) 
ax.plot([vmin,vmax],10**(y_pred),color='red',lw=1,zorder=1,label='Best fit')

ax.plot([vmin,vmax],[vmin,vmax],lw=1,color='black',ls='--',alpha=0.6,zorder=1,
            label='1:1 correlation')

ax.set_xscale('log')
ax.set_yscale('log')

Theil-Sen regression with sklearn on a log-log scale

1 Answers1