0

So I wrote a function to standardize my data but I'm having trouble making it work. I want to iterate through an array of my data and standardize it

Here's my function

I've tried Transposing my arr but it still doesn't work?

def Scaling(arr,data):    
    scaled=[[]]   
    for a in arr.T:
        scaled = ((a-data.mean())/(data.std()))
        scaled = np.asarray(scaled)
    return scaled

When I run my code I only get a 1D array as the output instead of 10D.

Parfait
  • 104,375
  • 17
  • 94
  • 125
user3111739
  • 45
  • 1
  • 3
  • 1
    A bit out of topic, but you could use zscore from scipy: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.zscore.html – Adonis Jan 29 '19 at 14:15
  • 2
    10D or a shape with 10 columns? And could you please post a sample input and output? – JE_Muc Jan 29 '19 at 14:51
  • 1
    Please add more information. What is data, how is this function called, what is arr? A quick google revealed multiple other SO answers like https://stackoverflow.com/questions/4544292/how-do-i-standardize-a-matrix/40951248 – ege Jan 29 '19 at 14:59

1 Answers1

1

Because data.mean() and data.std() are aggregated constants or scalars, consider running the needed arithmetic operation directly on entire array without any for loops. Each constant will be operated on each column of array in a vectorized operation:

def Scaling(arr,data):    
    return (arr.T-data.mean())/(data.std())

Your current for loop only outputs the last array assignment of loop. You initialize an empty nested list but do not ever append to it. In fact you re-assign and re-define scaled to an array with each iteration. Ideally you append arrays to a collection to concatenate together outside loop. Nonetheless, this type of operation is not needed with simple matrix algebra.


To demonstrate with random, seeded data (can be revised with OP's actual data) see below with an exaggerated sequential input array to show end calculations:

import numpy as np

np.random.seed(12919)
data = np.arange(10)
arr = np.concatenate([np.ones((5, 1)),
                      np.ones((5, 1))+1,
                      np.ones((5, 1))+2,
                      np.ones((5, 1))+3,
                      np.ones((5, 1))+4], axis=1)

def Scaling(arr,data):    
    return (arr.T-data.mean())/(data.std())

new_arr = Scaling(arr, data)

print(arr)
# [[1. 2. 3. 4. 5.]
#  [1. 2. 3. 4. 5.]
#  [1. 2. 3. 4. 5.]
#  [1. 2. 3. 4. 5.]
#  [1. 2. 3. 4. 5.]]

print(new_arr)
# [[-1.21854359 -1.21854359 -1.21854359 -1.21854359 -1.21854359]
#  [-0.87038828 -0.87038828 -0.87038828 -0.87038828 -0.87038828]
#  [-0.52223297 -0.52223297 -0.52223297 -0.52223297 -0.52223297]
#  [-0.17407766 -0.17407766 -0.17407766 -0.17407766 -0.17407766]
#  [ 0.17407766  0.17407766  0.17407766  0.17407766  0.17407766]]

Pyfiddle demo (click Run at top for output on right)

Parfait
  • 104,375
  • 17
  • 94
  • 125
  • Thank you it kind of works I'm getting the correct dimensions now, however, when I try to scale my test set with the means and std from the training set my results don't make sense 3 – user3111739 Jan 29 '19 at 18:45