19

If I try to run the script below I get the error: LinAlgError: SVD did not converge in Linear Least Squares. I have used the exact same script on a similar dataset and there it works. I have tried to search for values in my dataset that Python might interpret as a NaN but I cannot find anything.

My dataset is quite large and impossible to check by hand. (But I think my dataset is fine). I also checked the length of stageheight_masked and discharge_masked but they are the same. Does anyone know why there is an error in my script and what can I do about it?

import numpy as np
import datetime
import matplotlib.dates
import matplotlib.pyplot as plt
from scipy import polyfit, polyval

kwargs = dict(delimiter = '\t',\
     skip_header = 0,\
     missing_values = 'NaN',\
     converters = {0:matplotlib.dates.strpdate2num('%d-%m-%Y %H:%M')},\
     dtype = float,\
     names = True,\
     )

rating_curve_Gillisstraat = np.genfromtxt('G:\Discharge_and_stageheight_Gillisstraat.txt',**kwargs)

discharge = rating_curve_Gillisstraat['discharge']   #change names of columns
stageheight = rating_curve_Gillisstraat['stage'] - 131.258

#mask NaN
discharge_masked = np.ma.masked_array(discharge,mask=np.isnan(discharge)).compressed()
stageheight_masked = np.ma.masked_array(stageheight,mask=np.isnan(discharge)).compressed()

#sort
sort_ind = np.argsort(stageheight_masked)
stageheight_masked = stageheight_masked[sort_ind]
discharge_masked = discharge_masked[sort_ind]

#regression
a1,b1,c1 = polyfit(stageheight_masked, discharge_masked, 2)
discharge_predicted = polyval([a1,b1,c1],stageheight_masked)

print 'regression coefficients'
print (a1,b1,c1)

#create upper and lower uncertainty
upper = discharge_predicted*1.15
lower = discharge_predicted*0.85

#create scatterplot

plt.scatter(stageheight,discharge,color='b',label='Rating curve')
plt.plot(stageheight_masked,discharge_predicted,'r-',label='regression line')
plt.plot(stageheight_masked,upper,'r--',label='15% error')
plt.plot(stageheight_masked,lower,'r--')
plt.axhline(y=1.6,xmin=0,xmax=1,color='black',label='measuring range')
plt.title('Rating curve Catsop')
plt.ylabel('discharge')
plt.ylim(0,2)
plt.xlabel('stageheight[m]')
plt.legend(loc='upper left', title='Legend')
plt.grid(True)
plt.show()
marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Toine Kerckhoffs
  • 293
  • 2
  • 4
  • 11
  • 1
    I'm pretty sure that `polyfit` doesn't support masked arrays, so it will treat NaNs like any other value. You also need to check for infinite values (e.g. using `np.isinf`). – ali_m Feb 23 '16 at 22:43
  • Another reason might be is that your have a "vertical line" in your data ! – Yahya Mar 30 '23 at 17:40

5 Answers5

27

I don't have your data file, but it almost always that case that when you get that error you have NaN's or infinity in your data. Look for both of those using pd.notnull or np.isfinite

ski_squaw
  • 972
  • 1
  • 11
  • 21
2

As others have pointed out, the problem is likely that there are rows without numericals for the algorithm to work with. This is an issue with most regressions.

That's the problem. The solution then, is to do something about that. And that depends on the data. Often, you can replace the NaNs with 0s, using Pandas .fillna(0) for example. Sometimes, you might have to interpolate missing values, and Pandas .interpolate() is probably the simplest solution to that as well. Or, when it's not a time series, you might be able to simply drop the rows with NaNs in them, using for example Pandas .dropna() method. Or, sometimes it's not about the NaNs, but about the infs or others, and then there are other solutions for that: https://stackoverflow.com/a/55293137/12213843

Exactly which way to go about it, is up to the data. And it's up to you to interpret the data. And domain knowledge goes a long way to interpret the data well.

Robin
  • 21
  • 2
1

As ski_squaw mentions the error is most of the time due to NaN's, however for me this error came after a windows update. I was using numpy version 1.16. Moving my numpy version to 1.19.3 solved the issue. (run pip install numpy==1.19.3 --user in the cmd)

This gitHub issue explains it more: https://github.com/numpy/numpy/issues/16744

Numpy 1.19.3 doesn't work on Linux and 1.19.4 doesn't work on Windows.

Joris
  • 1,158
  • 1
  • 16
  • 25
0

I developed a code on windows 8. So now I'm using windows 10 and the problem popped up! It was resolved as @Joris said.

pip install numpy==1.19.3

Leonardo
  • 120
  • 9
  • 2
    While this is a valid answer to the question, at least in your use case, it doesn't add new information that was not already in @Joris's answer. It is best not to post duplicate answers like this. – joanis Sep 18 '21 at 22:12
0

my example after fix:

def calculating_slope(x):
        x = x.replace(np.inf, np.nan).replace(-np.inf, np.nan).dropna()
        if len(x)>1:
            slope = np.polyfit(range(len(x)), x, 1)[0]
        else: 
            slope = 0
        return slope