Pythonic linear regression

Question

I've drawn up a simple Linear Regression piece of code.

Would there be a more beautiful or pythonic way of writing this up?

python
#!/usr/bin/env python
# -*- coding: utf-8 -*-

from scipy import stats
from random import randint
import numpy as np

def regress(y, x):

    reg = slope,intercept,r_value,p_value,std_err = stats.linregress(x,y)   ## generate regression elements
    yhat = x*reg.slope + intercept                                          ## predict y using with slope(coefficient) and intercept


if __name__=="__main__":
    x= np.array([randint(0,1000) for n in range(0,100)])            ## generate 100 random integers between 1 and 1000 for x
    y= np.array([randint(0,1000) for n in range(0,100)])            ## generate 100 random integers between 1 and 1000 for y
    regress(y,x)                                                    ## run function using the 100 random integers for x & y

Thanks for any input.

I won't comment on style issues, but you could make better use of `numpy` when generating the arrays. `np.random.randint` take a size parameter as well as the range ones. Another minor point - you probably want a way of seeing the result. Add a `return` to the function, and print the result. — hpaulj, May 19 '19 at 23:44

Dan · Accepted Answer · 2019-05-19T23:11:23.897

This question belongs on code review and not on stack overflow.
Use comments to explain why not what. Your code should be clear enough that what it is doing does not require comments. But sometimes (not in this case), you'll need a comment to explain why you've done something that isn't obvious.
Loops and list comprehensions with numpy could be considered a code smell. First, look for built-in functions, then try to find a vectorized approach. Failing that, you might need to resort to a loop/list-comprehension but in general, this won't be the case. for example, in this case, numpy comes with np.random.randint.
Use variables rather than passing constant values into function, especially if you're using them twice! You want 1000 values in your x and y arrays, put that in a variable.

Your code will re-fit the regression every time you call regress which is computationally wasteful. Take a look at the way interp1d works in scipy. Its output is a function that you can reuse for interpolating. This would be a good pattern in your case as well and you can implement it using a concept from functional programming called a closure. This will be easier to explain in code:

def regress(x, y):
    """
    A docstring is more pythonic than unneeded inline comments: https://www.python.org/dev/peps/pep-0257/
    """
    slope, intercept, r_value, p_value, std_err = stats.linregress(x,y)   

    def regression_function(xi):  # This bit is the closure. Notice how it has access to the variables that are in the parent functions scope. The closure will remember the state of those values even after they go out of scope with the parent function. 
        return xi*slope + intercept 

    return regression_function  # Return the actual function you created above itself so that you can reuse it later.

and to use it:

n = 1000
data_min = 0
data_max = 100
x = np.random.randint(data_min, data_max, (0,n))          
y = np.random.randint(data_min, data_max, (0,n))          
f_reg = regress(x,y)
xi = np.arange(1000)
yi = f_reg(xi)

Another option could be to use scikit-learn. Instead of a closure, scikit-learn uses an object-oriented approach to remembering state. In this case, you call the fit method once upfront to learn the state and then the predict method later to re-use that learned state.
Finally, and this is really important, there is nothing pythonic about using 2.7 in 2019. Switch over to python 3. Support is being dropped for 2 next year. Some major libraries, like pandas, have already dropped support for 2. Don't learn to use a language that is already obsolete!

Don't send things off to CR unless you are active there and prepared to answer this question. — hpaulj, May 19 '19 at 23:29
@hpaulj ...unless I am prepared to answer this question?? I think I already have? But also, are you saying that this question is in scope for stack overflow? — Dan, May 19 '19 at 23:31
Indeed you did answer, very much in CR style. Part of why I discourage CR recommendations is that the `numpy/simpy` traffic is too low there. But this poster did ask for `beautiful` and `pythonic`. https://codereview.stackexchange.com/questions/219781/improve-performance-of-comparing-two-numpy-arrays is an example which I think should not have been migrated. — hpaulj, May 20 '19 at 02:10

Pythonic linear regression

1 Answers1