1

I want to see what effect multi-collinearity has on a linear regression model but I need to be able to generate multi collinear data where I can vary the number of features and the collinearity between these features.

I've had a look at Sklearn's make_regression function and it allows for the generation of multiple features but from what I understand these features are all uncorrelated correct?

If so, does anyone know how I could vary the correlation between these features or use a different method to generate a linearly multi-collinear dataset to train Sklearn's linear regression model with?

Flavia Giammarino
  • 7,987
  • 11
  • 30
  • 40
Dwo
  • 79
  • 7

1 Answers1

3

You could simulate the features from the multivariate normal distribution as follows:

import numpy as np
from sklearn.linear_model import LinearRegression

def make_regression(n_samples, n_uncorrelated, n_correlated, correlation, weights, bias, noise=1, seed=42):

    np.random.seed(seed)

    X_correlated = np.random.multivariate_normal(
        mean=np.zeros(n_correlated),
        cov=correlation * np.ones((n_correlated, n_correlated)) + (1 - correlation) * np.eye(n_correlated),
        size=n_samples
    )

    X_uncorrelated = np.random.multivariate_normal(
        mean=np.zeros(n_uncorrelated),
        cov=np.eye(n_uncorrelated),
        size=n_samples
    )

    X = np.hstack([X_correlated, X_uncorrelated])
    e = np.random.normal(loc=0, scale=noise, size=n_samples)
    y = bias + np.dot(X, weights) + e

    return X, y

X, y = make_regression(
    n_samples=1000,
    n_uncorrelated=1,
    n_correlated=3,
    correlation=0.999,
    weights=[0.5, 0.5, 0.5, 0.5],
    bias=0,
)

print(np.round(np.corrcoef(X, rowvar=False), 1))
# [[ 1.  1.  1. -0.]
#  [ 1.  1.  1. -0.]
#  [ 1.  1.  1. -0.]
#  [-0. -0. -0.  1.]]

reg = LinearRegression()
reg.fit(X, y)

print(reg.intercept_)
# -0.0503434375710194

print(reg.coef_)
# [0.62245063 -0.43110213  1.31516103  0.52019845]
Flavia Giammarino
  • 7,987
  • 11
  • 30
  • 40
  • Hey, this is exactly what I was looking for. Thanks! One question though, the correlation parameter sets the same level of correlation for all features am I correct? How would it work if I wanted to control the degree of correlation between specific features? – Dwo Dec 20 '21 at 20:30
  • 1
    Yes that's correct, for simplicity the correlation coefficient between each pair of correlated features is assumed to be the same. If you want to use different correlation coefficients you need to define the covariance matrix manually. – Flavia Giammarino Dec 20 '21 at 20:40