3

A neural network with no hidden layers and sigmoid/softmax activation is just logistic regression:

from sklearn.datasets import load_iris
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression
X, y = load_iris(return_X_y=True)
nn = MLPClassifier(hidden_layer_sizes=(), solver = 'lbfgs', activation='logistic', alpha = 0).fit(X,y)
l = LogisticRegression(penalty='none', solver = 'lbfgs',  fit_intercept = False).fit(X,y)

so why don't these two models produce the same coefficients? Most of them are close but there are a few discrepancies:

print("NN")
print(nn.coefs_[0].T)
print("\nLogistic")
print(l.coef_)
NN
[[  5.40104629  11.39328515 -16.50698752  -7.86329804]
 [ -1.06741383  -2.48638863   3.37921506  -5.29842503]
 [ -3.55724865  -9.11027371  12.79749019  12.9357708 ]]

Logistic
[[  5.10297361  11.87381176 -16.50600209  -7.70449685]
 [  0.61357365  -2.6277241    4.03442742  -1.28869255]
 [ -5.71654726  -9.24608766  12.47157468   8.9931894 ]]
Flavia Giammarino
  • 7,987
  • 11
  • 30
  • 40
invictus
  • 1,821
  • 3
  • 25
  • 30

2 Answers2

2

There are some concerns with your comparison, but rectifying them does not resolve the issue; so, this is only a partial answer.

First, the MLP classifier includes a bias (intercept) node by default (the presence of which, unlike with LR, is not customizable), so you need fit_intercept = True in LR.

Second, despite the same solvers in both models, the max_iter default values are different, so we should set them to be equal.

Third, and for keeping the issue as simple as possible, it would arguably be a good idea to keep the discussion in a binary classification setting, instead of a multi-class one.

Here is your code modified as per the above:

from sklearn.datasets import load_iris
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.utils import shuffle

X, y = load_iris(return_X_y=True)

X, y = shuffle(X[:100,], y[:100], random_state=42) # keep only classes 0/1 (binary problem)

nn = MLPClassifier(hidden_layer_sizes=(), solver = 'lbfgs', activation='logistic', alpha = 0, max_iter=100).fit(X,y)
l = LogisticRegression(penalty='none', solver = 'lbfgs',  fit_intercept = True).fit(X,y)

print("NN coefficients & intercept")
print(nn.coefs_[0].T)
print(nn.intercepts_)
print("\nLR coefficients & intercept")
print(l.coef_)
print(l.intercept_)

Results:

NN coefficients & intercept
[[-1.34230329 -4.29615611  7.14868389  2.66752688]]
[array([-0.90035086])]

LR coefficients & intercept
[[-2.07247339 -6.90694692 10.97006745  5.64543091]]
[-1.05932537]

Thing is, if you run the above code several times (I have not set any random state, save for the one for data shuffling), you'll see that, while the LR results are each time the same, the MLP results differ themselves from run to run. Here is another short experiment demonstrating and quantifying this:

nn_coef = []
nn_intercept = []
lr_coef = []
lr_inter = []

for i in range(0,20):
  nn = MLPClassifier(hidden_layer_sizes=(), solver = 'lbfgs', activation='logistic', alpha = 0, max_iter=100).fit(X,y)
  l = LogisticRegression(penalty='none', solver = 'lbfgs',  fit_intercept = True).fit(X,y)

  nn_coef.append(nn.coefs_[0].T)
  nn_intercept.append(nn.intercepts_)
  lr_coef.append(l.coef_)
  lr_inter.append(l.intercept_)

import numpy as np

# get the standard deviations of coefficients & intercepts between runs:

print(np.std(nn_coef, axis=0))
print(np.std(lr_coef, axis=0))
print()
print(np.std(nn_intercept))
print(np.std(lr_inter))

Results:

[[0.14334883 0.42125216 0.46115555 0.4488226 ]]
[[0.00000000e+00 8.88178420e-16 1.77635684e-15 8.88178420e-16]]

0.3393994986547498
0.0

So, it is apparent that, while the standard deviations of the LR coefficients & intercept are practically zero, the respective standard deviations of the MLP parameters are quite large indeed.

It would seem that the MLP algorithm, at least with the L-BFGS solver, is very sensitive to the initialization of the weights & biases, which is not the case with the LR. This seems to be the implicit assumption in a relevant Github thread, too. But I agree with your implicit expectation that it should not be so.

If no one else comes up with a satisfactory answer, I guess it is a good candidate case for opening a Github issue.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
2

As pointed out by @desertnaut the MLP initialization seems indeed to be the issue as the difference between the MLP and LR coefficients appear to decrease as the sample size increases.

from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification

random_state = 100
n_samples = 1000

X, y = make_classification(n_samples=n_samples, n_features=2, n_redundant=0, n_informative=2, n_clusters_per_class=1, random_state=random_state)
X = StandardScaler().fit_transform(X)

nn = MLPClassifier(hidden_layer_sizes=(), solver='lbfgs', activation='logistic', alpha=0, max_iter=1000, tol=0, random_state=random_state).fit(X,y)
lr = LogisticRegression(penalty='none', solver='lbfgs', fit_intercept=True, max_iter=1000, tol=0, random_state=random_state).fit(X,y)

print(nn.intercepts_[0])
print(lr.intercept_)
# [-1.08397244]
# [-1.08397505]

print(nn.coefs_[0].T)
print(lr.coef_)
# [[ 2.90716947 -3.08525711]]
# [[ 2.90718263 -3.08525826]]

The code below shows that as the sample size increases the variance of the MLP coefficients decreases and both the MLP coefficients and the LR coefficients converge to the true coefficients, even though the exact cut-off point depends on the dataset.

import numpy as np
import pandas as pd
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# sample sizes
n_samples = [25, 50, 75, 100, 250, 500, 750, 1000, 5000, 10000]

# number of refits of the MLP and LR
# models for each sample size
n_repetitions = 100

# synthetic data
true_intercept = 10
true_weights = [20, 30]
X = np.random.multivariate_normal(np.zeros(2), np.eye(2), np.max(n_samples))
Z = true_intercept + np.dot(X, true_weights) + np.random.normal(0, 1, np.max(n_samples))
p = 1 / (1 + np.exp(- Z))
y = np.random.binomial(1, p, np.max(n_samples))

# data frame for storing the results for each sample size
output = pd.DataFrame(columns=['sample size', 'label avg.', 'LR intercept avg.', 'LR intercept std.', 'NN intercept avg.',
'NN intercept std.', 'LR first weight avg.', 'LR first weight std.', 'NN first weight avg.', 'NN first weight std.',
'LR second weight avg.', 'LR second weight std.', 'NN second weight avg.', 'NN second weight std.'])

# loop across the different
# sample sizes "n"
for n in n_samples:

    lr_intercept, lr_coef = [], []
    nn_intercept, nn_coef = [], []

    # refit the MLP and LR models multiple times
    # using the first "n" samples
    for k in range(n_repetitions):

        nn = MLPClassifier(hidden_layer_sizes=(), solver='lbfgs', activation='logistic', alpha=0, max_iter=1000, tol=0)
        lr = LogisticRegression(penalty='none', solver='lbfgs', fit_intercept=True, max_iter=1000, tol=0)

        nn.fit(X[:n, :], y[:n])
        lr.fit(X[:n, :], y[:n])

        lr_intercept.append(lr.intercept_)
        nn_intercept.append(nn.intercepts_[0])

        lr_coef.append(lr.coef_)
        nn_coef.append(nn.coefs_[0].T)

    # save the sample mean and sample standard deviations
    # of the MLP and LR estimated coefficients for the
    # considered sample size "n"
    output = output.append(pd.DataFrame({
        'sample size': [n],
        'label avg.': [np.mean(y[:n])],
        'LR intercept avg.': [np.mean(lr_intercept)],
        'LR intercept std.': [np.std(lr_intercept, ddof=1)],
        'NN intercept avg.': [np.mean(nn_intercept)],
        'NN intercept std.': [np.std(nn_intercept, ddof=1)],
        'LR first weight avg.': [np.mean(lr_coef, axis=0)[0][0]],
        'LR first weight std.': [np.std(lr_coef, ddof=1, axis=0)[0][0]],
        'NN first weight avg.': [np.mean(nn_coef, axis=0)[0][0]],
        'NN first weight std.': [np.std(nn_coef, ddof=1, axis=0)[0][0]],
        'LR second weight avg.': [np.mean(lr_coef, axis=0)[0][1]],
        'LR second weight std.': [np.std(lr_coef, ddof=1, axis=0)[0][1]],
        'NN second weight avg.': [np.mean(nn_coef, axis=0)[0][1]],
        'NN second weight std.': [np.std(nn_coef, ddof=1, axis=0)[0][1]],
    }), ignore_index=True)

# plot the results
fig = make_subplots(rows=3, cols=1, subplot_titles=['Intercept', 'First Weight', 'Second Weight'])

fig.add_trace(go.Scatter(
    x=output['sample size'],
    y=[true_intercept] * output.shape[0],
    mode='lines',
    line=dict(color='rgb(82, 188, 163)', dash='dot', width=1),
    legendgroup='True Value',
    name='True Value',
    showlegend=True,
), row=1, col=1)

fig.add_trace(go.Scatter(
    x=output['sample size'],
    y=output['LR intercept avg.'] + output['LR intercept std.'],
    mode='lines',
    line=dict(color='rgba(229, 134, 6, 0.2)'),
    legendgroup='Logistic Regression',
    showlegend=False,
), row=1, col=1)

fig.add_trace(go.Scatter(
    x=output['sample size'],
    y=output['LR intercept avg.'] - output['LR intercept std.'],
    mode='lines',
    fill='tonexty',
    fillcolor='rgba(229, 134, 6, 0.2)',
    line=dict(color='rgba(229, 134, 6, 0.2)'),
    legendgroup='Logistic Regression',
    showlegend=False,
), row=1, col=1)

fig.add_trace(go.Scatter(
    x=output['sample size'],
    y=output['LR intercept avg.'],
    mode='lines',
    line=dict(color='rgb(229, 134, 6)', dash='dot', width=1),
    legendgroup='Logistic Regression',
    name='Logistic Regression',
    showlegend=True,
), row=1, col=1)

fig.add_trace(go.Scatter(
    x=output['sample size'],
    y=output['NN intercept avg.'] + output['NN intercept std.'],
    mode='lines',
    line=dict(color='rgba(93, 105, 177, 0.2)'),
    legendgroup='Logistic Regression',
    showlegend=False,
), row=1, col=1)

fig.add_trace(go.Scatter(
    x=output['sample size'],
    y=output['NN intercept avg.'] - output['NN intercept std.'],
    mode='lines',
    fill='tonexty',
    fillcolor='rgba(93, 105, 177, 0.2)',
    line=dict(color='rgba(93, 105, 177, 0.2)'),
    legendgroup='Logistic Regression',
    showlegend=False,
), row=1, col=1)

fig.add_trace(go.Scatter(
    x=output['sample size'],
    y=output['NN intercept avg.'],
    mode='lines',
    line=dict(color='rgb(93, 105, 177)', dash='dot', width=1),
    legendgroup='MLP Regression',
    name='MLP Regression',
    showlegend=True,
), row=1, col=1)

fig.update_xaxes(
    title='Sample Size',
    type='category',
    mirror=True,
    linecolor='#d9d9d9',
    showgrid=False,
    zeroline=False,
    row=1, col=1
)

fig.update_yaxes(
    title='Estimate',
    mirror=True,
    linecolor='#d9d9d9',
    showgrid=False,
    zeroline=False,
    row=1, col=1
)

fig.add_trace(go.Scatter(
    x=output['sample size'],
    y=[true_weights[0]] * output.shape[0],
    mode='lines',
    line=dict(color='rgb(82, 188, 163)', dash='dot', width=1),
    legendgroup='True Value',
    showlegend=False,
), row=2, col=1)

fig.add_trace(go.Scatter(
    x=output['sample size'],
    y=output['LR first weight avg.'] + output['LR first weight std.'],
    mode='lines',
    line=dict(color='rgba(229, 134, 6, 0.2)'),
    legendgroup='Logistic Regression',
    showlegend=False,
), row=2, col=1)

fig.add_trace(go.Scatter(
    x=output['sample size'],
    y=output['LR first weight avg.'] - output['LR first weight std.'],
    mode='lines',
    fill='tonexty',
    fillcolor='rgba(229, 134, 6, 0.2)',
    line=dict(color='rgba(229, 134, 6, 0.2)'),
    legendgroup='Logistic Regression',
    showlegend=False,
), row=2, col=1)

fig.add_trace(go.Scatter(
    x=output['sample size'],
    y=output['LR first weight avg.'],
    mode='lines',
    line=dict(color='rgb(229, 134, 6)', dash='dot', width=1),
    legendgroup='Logistic Regression',
    showlegend=False,
), row=2, col=1)

fig.add_trace(go.Scatter(
    x=output['sample size'],
    y=output['NN first weight avg.'] + output['NN first weight std.'],
    mode='lines',
    line=dict(color='rgba(93, 105, 177, 0.2)'),
    legendgroup='MLP Regression',
    showlegend=False,
), row=2, col=1)

fig.add_trace(go.Scatter(
    x=output['sample size'],
    y=output['NN first weight avg.'] - output['NN first weight std.'],
    mode='lines',
    fill='tonexty',
    fillcolor='rgba(93, 105, 177, 0.2)',
    line=dict(color='rgba(93, 105, 177, 0.2)'),
    legendgroup='MLP Regression',
    showlegend=False,
), row=2, col=1)

fig.add_trace(go.Scatter(
    x=output['sample size'],
    y=output['NN first weight avg.'],
    mode='lines',
    line=dict(color='rgb(93, 105, 177)', dash='dot', width=1),
    legendgroup='MLP Regression',
    showlegend=False,
), row=2, col=1)

fig.update_xaxes(
    title='Sample Size',
    type='category',
    mirror=True,
    linecolor='#d9d9d9',
    showgrid=False,
    zeroline=False,
    row=2, col=1
)

fig.update_yaxes(
    title='Estimate',
    mirror=True,
    linecolor='#d9d9d9',
    showgrid=False,
    zeroline=False,
    row=2, col=1
)

fig.add_trace(go.Scatter(
    x=output['sample size'],
    y=[true_weights[1]] * output.shape[0],
    mode='lines',
    line=dict(color='rgb(82, 188, 163)', dash='dot', width=1),
    legendgroup='True Value',
    showlegend=False,
), row=3, col=1)

fig.add_trace(go.Scatter(
    x=output['sample size'],
    y=output['LR second weight avg.'] + output['LR second weight std.'],
    mode='lines',
    line=dict(color='rgba(229, 134, 6, 0.2)'),
    legendgroup='Logistic Regression',
    showlegend=False,
), row=3, col=1)

fig.add_trace(go.Scatter(
    x=output['sample size'],
    y=output['LR second weight avg.'] - output['LR second weight std.'],
    mode='lines',
    fill='tonexty',
    fillcolor='rgba(229, 134, 6, 0.2)',
    line=dict(color='rgba(229, 134, 6, 0.2)'),
    legendgroup='Logistic Regression',
    showlegend=False,
), row=3, col=1)

fig.add_trace(go.Scatter(
    x=output['sample size'],
    y=output['LR second weight avg.'],
    mode='lines',
    line=dict(color='rgb(229, 134, 6)', dash='dot', width=1),
    legendgroup='Logistic Regression',
    showlegend=False,
), row=3, col=1)

fig.add_trace(go.Scatter(
    x=output['sample size'],
    y=output['NN second weight avg.'] + output['NN second weight std.'],
    mode='lines',
    line=dict(color='rgba(93, 105, 177, 0.2)'),
    legendgroup='MLP Regression',
    showlegend=False,
), row=3, col=1)

fig.add_trace(go.Scatter(
    x=output['sample size'],
    y=output['NN second weight avg.'] - output['NN second weight std.'],
    mode='lines',
    fill='tonexty',
    fillcolor='rgba(93, 105, 177, 0.2)',
    line=dict(color='rgba(93, 105, 177, 0.2)'),
    legendgroup='MLP Regression',
    showlegend=False,
), row=3, col=1)

fig.add_trace(go.Scatter(
    x=output['sample size'],
    y=output['NN second weight avg.'],
    mode='lines',
    line=dict(color='rgb(93, 105, 177)', dash='dot', width=1),
    legendgroup='MLP Regression',
    showlegend=False,
), row=3, col=1)

fig.update_xaxes(
    title='Sample Size',
    type='category',
    mirror=True,
    linecolor='#d9d9d9',
    showgrid=False,
    zeroline=False,
    row=3, col=1
)

fig.update_yaxes(
    title='Estimate',
    mirror=True,
    linecolor='#d9d9d9',
    showgrid=False,
    zeroline=False,
    row=3, col=1
)

fig.update_layout(
    plot_bgcolor='white',
    paper_bgcolor='white',
    legend=dict(x=0, y=1.125, orientation='h'),
    font=dict(family='Arial', size=6),
    margin=dict(t=40, l=20, r=20, b=20)
)

fig.update_annotations(
    font=dict(family='Arial', size=8)
)

# fig.write_image('LR_MLP_comparison.png', engine='orca', scale=4, height=500, width=400)
fig.write_image('LR_MLP_comparison.png', engine='kaleido', scale=4, height=500, width=400)

enter image description here

Flavia Giammarino
  • 7,987
  • 11
  • 30
  • 40
  • Nice catch; did you try to run multiple experiments w/o fixing the random state, to see how stable the MLP results are? – desertnaut Jun 26 '21 at 13:24
  • 1
    I have just updated my answer but there are some exceptions, in the sense that there are some datasets for which the MLP and LR coefficients are nearly identical even with a small sample size of 25 - 50 data points. – Flavia Giammarino Jun 26 '21 at 17:11