1

I am fitting a linear regression model with scikit-learn. The training of the model works fine on its own and runs without errors. The problem is that, if I plot a histogram of the dataset using matplotlib before I train the model, this leads to an error when training the model. I have already figured out that plotting any matplotlib histogram before training the model leads to an error so it has nothing to do with manipulating the dataset when creating the histogram.

Here are my library version numbers:

numpy 1.19.0
scikit-learn 0.23.0
matplotlib 3.1.0
pandas 1.0.5 

Minimum reproducible example (adapted from an official sklearn example)

import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model

plt.hist([1, 2, 3])
plt.show()

diabetes_X, diabetes_y =datasets.load_diabetes(return_X_y=True)
diabetes_X = diabetes_X[:, np.newaxis, 2]
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
diabetes_y_train = diabetes_y[:-20]
diabetes_y_test = diabetes_y[-20:]
regr = linear_model.LinearRegression()
regr.fit(diabetes_X_train, diabetes_y_train)

Resulting Error:

Traceback (most recent call last):
File "test2.py", line 15, in <module> regr.fit(diabetes_X_train, diabetes_y_train)
File "base.py", line 547, in fit linalg.lstsq(X, y)
File "basic.py", line 1226, in lstsq
% (-info, lapack_driver))
ValueError: illegal value in 4-th argument of internal None

Minimum example from my code:

from sklearn import datasets
import sklearn.model_selection as ms
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression


plt.hist([1,2,3,4,5,6,7,8,9])
plt.show()

dataset_boston = datasets.load_boston(return_X_y=False)

indices = list(range(0, len(dataset_boston.data)))
dataframe = pd.DataFrame(data=dataset_boston.data, index=indices, columns=dataset_boston.feature_names)
targets = pd.DataFrame(data=dataset_boston.target,  index=indices, columns=['target'])

dataframe['target'] = targets
dataframe = dataframe.replace([np.inf, -np.inf], np.nan)
dataframe = dataframe.dropna()
train_set_ex4, test_set_ex4 = ms.train_test_split(dataframe, test_size=0.2, random_state=42, shuffle=True)
train_examples = train_set_ex4.loc[:, train_set_ex4.columns != 'target']
train_targets = train_set_ex4['target']

reg = LinearRegression()
reg.fit(train_examples, train_targets)

Resulting Error:

Traceback (most recent call last):
File "test.py", line 25, in <module> reg.fit(train_examples, train_targets)
File "base.py", line 547, in fit linalg.lstsq(X, y)
File "basic.py", line 1223, in lstsq
raise LinAlgError("SVD did not converge in Linear Least Squares")
numpy.linalg.LinAlgError: SVD did not converge in Linear Least Squares

Both of these examples run without any error when the histogram plot is removed. Am I doing something wrong or is there perhaps a bug in one of the libraries used?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Berisol
  • 77
  • 1
  • 6
  • That may be a wild guess here, but if you happen to run on a recently updated Windows 10, you may be affected, like others, from a [known bug](https://github.com/numpy/numpy/issues/16744), see also potentially relevant discussions [here](https://stackoverflow.com/questions/64654805/how-do-you-fix-runtimeerror-package-fails-to-pass-a-sanity-check-for-numpy-an) and [here](https://developercommunity.visualstudio.com/content/problem/1207405/fmod-after-an-update-to-windows-2004-is-causing-a.html) – Asmus Nov 23 '20 at 13:35

0 Answers0