Python, Pandas: 80/20 Randomly Split Data; How to loop when index value is 'missing'?

Question

I am trying to loop through a Series data type which was randomly generated from an existing data set to serve as a training data set). Here is the output of my Series data set after the split:

Index     data
0         1150
1         2000
2         1800
.         .
.         .
.         .
1960      1800
1962      1200
.         .
.         .
.         .
20010     1500

There is no index of 1961 because the random selection process to create the training data set removed it. When I try to loop through to calculate my residual sum squares it does not work. Here is my loop code:

def ResidSumSquares(x, y, intercept, slope):    
    out = 0
    temprss = 0
    for i in x:
        out = (slope * x.loc[i]) + intercept
        temprss = temprss + (y.loc[i] - out)
    RSS = temprss**2
    return print("RSS: {}".format(RSS))

KeyError: 'the label [1961] is not in the [index]'

I am still very new to Python and I am not sure of the best way to fix this.

Thank you in advance.

score 0 · Answer 1 · edited May 23 '17 at 12:30

I found the answer right after I posted the question, my apologies. Posted by @mkln

How to reset index in a pandas data frame?

df = df.reset_index(drop=True)

This resets the index of the entire Series and it is not exclusive to DataFrame data type.

My updated function code works like a charm:

def ResidSumSquares(x, y, intercept, slope):    
    out = 0
    myerror = 0
    x = x.reset_index(drop=True)    
    y = y.reset_index(drop=True)    
    for i in x:      
        out = slope * x.loc[i] + float(intercept)
        myerror = myerror + (y.loc[i] - out)
    RSS = myerror**2
    return print("RSS: {}".format(RSS))

score 0 · Answer 2 · answered Jan 11 '18 at 19:38

0

You omit your actual call to ResidSumSquares. How about not resetting the index within the function and passing the training set as x. Iterating over an unusual (not 1,2,3,...) index shouldn't be a problem

answered Jan 11 '18 at 19:38

3pitt

899
13
21

Patrick O'Connor · Answer 3 · 2018-01-11T20:12:13.850

A few observations:

As currently written your function is calculating the squared sum of the error, not the sum of squared error... is this intentional? The latter is typically what is used in regression type applications. Since your variable is named RSS--I assume residual sum of squares, you will want to revisit.
If x and y are consistent subsets of the same larger dataset, the you should have the same indices for both, right? Otherwise by dropping the index you may be matching unrelated x and y variables and glossing over a bug earlier in the code.
Since you are using Pandas this can be easily vectorized to improve readability and speed (Python loops have high overhead)

Example of (3), assuming (2), and illustrating the differences between approaches in (1):

#assuming your indices should be aligned, 
#pandas will link xs and ys by index
vectorized_error = y - slope*x + float(intercept)
#your residual sum of squares--you have to square first!
rss = (vectorized_error**2).sum()
# if you really want the square of the summed errors...
sse = (vectorized_error.sum())**2

Edit: didn't notice this has been dead for a year.

Python, Pandas: 80/20 Randomly Split Data; How to loop when index value is 'missing'?

3 Answers3