1

I'm trying to understand why the accuracy of my algorithm has suddenly changed quite dramatically. One small change I did what that I added a forth : when I discovered that I was using only 3 indexes when standardizing my 4-dimensional train/test set. And now I'm curious - would the below old/new code do the same? If not, how does indexing into a 4-dimensional array using only 3 indexes work?

Old:

   # standardize all non-binary variables
   channels = 14 # int(X.shape[1])
   mu_f     = np.zeros(shape=channels)
   sigma_f  = np.zeros(shape=channels)

   for i in range(channels):
      mu_f[i]    = np.mean(X_train[:,i,:])
      sigma_f[i] = np.std(X_train[:,i,:])   

   for i in range(channels):
      X_train[:, i, :]  -= mu_f[i]   
      X_test[:, i, :]   -= mu_f[i]

      if (sigma_f[i] != 0):
         X_train[:, i, :]  /= sigma_f[i]
         X_test[:, i, :]   /= sigma_f[i]

New:

   # standardize all non-binary variables
   channels = 14
   mu_f     = np.zeros(shape=channels)
   sigma_f  = np.zeros(shape=channels)

   for i in range(channels):
      mu_f[i]    = np.mean(X_train[:,i,:,:])
      sigma_f[i] = np.std(X_train[:,i,:,:])   

   for i in range(channels):
      X_train[:, i, :, :]  -= mu_f[i]   
      X_test[:, i, :, :]   -= mu_f[i]

      if (sigma_f[i] != 0):
         X_train[:, i, :, :]  /= sigma_f[i]
         X_test[:, i, :, :]   /= sigma_f[i]
pir
  • 5,513
  • 12
  • 63
  • 101
  • Do you want to know why it is slower or how you can speed up things? ... because almost everyting can be vectorized in your code which should boost performance ... – plonser Apr 16 '15 at 16:01
  • That would be great! It would make more sense if you added an answer here http://stackoverflow.com/questions/29418031/standardization-preprocessing-for-4-dimensional-array instead. – pir Apr 16 '15 at 16:06
  • 1
    To answer your question, yes, they both do the exact same thing. You could actually leave out all the trailing `:` and would still get the same result. – Jaime Apr 16 '15 at 17:12

1 Answers1

2

I don't see why the extra : makes a difference. It doesn't when I do time tests on a simple np.mean(X[:,1]) v np.mean(X,1,:,:], etc.

As for plonser's suggestion that you can vectorize the whole thing, the key is realizing that mean and std take some added parameters. Check their docs and play around with sample arrays.

Xmean = np.mean(X,axis=(0,2,3),keepdims=True)
X -= Xmean
X /= Xmean
hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • and he names his variable `X_train`. :) It's just an illustration. – hpaulj Apr 16 '15 at 16:28
  • Thanks. Are you also certain that my old code vs. my new code returns the same result for different kinds of values in `X_train`? – pir Apr 16 '15 at 16:46
  • No. It's possible I've missed some nuances in what you are trying do. You need to set up parallel functions and test them - both for values and speed. – hpaulj Apr 16 '15 at 18:41