0

I have a numpy array that has many samples in it of varying length

Samples = np.array([[1001, 1002, 1003],
                    ... ,
                    [1001, 1002]])

I want to (elementwise) subtract the mean of the array then divide by the standard deviation of the array. Something like:

newSamples = (Samples-np.mean(Samples))/np.std(Samples)

Except that doesn't work for irregular shaped arrays,

np.mean(Samples) causes

unsupported operand type(s) for /: 'list' and 'int'

due to what I assume to be it having set a static size for each axis and then when it encounters a different sized sample it can't handle it. What is an approach to solve this using numpy?

example input:

Sample = np.array([[1, 2, 3],
                   [1, 2]])

After subtracting by the mean and then dividing by standard deviation:

Sample = array([[-1.06904497,  0.26726124,  1.60356745], 
                [-1.06904497,  0.26726124]])
Michael Hackman
  • 491
  • 2
  • 6
  • 21
  • 3
    Your "array" is not one of numbers but of list objects. See https://stackoverflow.com/questions/44293329/numpy-array-division-unsupported-operand-types-for-list-and-float – Ignacio Vergara Kausel Jun 01 '17 at 07:43
  • 5
    Possible duplicate of [Numpy Array Division - unsupported operand type(s) for /: 'list' and 'float'](https://stackoverflow.com/questions/44293329/numpy-array-division-unsupported-operand-types-for-list-and-float) – Ignacio Vergara Kausel Jun 01 '17 at 07:45
  • Do you want to use the mean/standard deviation of the whole array (Samples) or just of each element of Samples? – Nuageux Jun 01 '17 at 07:47
  • What's your expected output of (Samples-np.mean(Samples))/np.std(Samples)? – Allen Qin Jun 01 '17 at 07:48
  • Will add an edit to the question to demonstrate example input and output – Michael Hackman Jun 01 '17 at 07:49
  • 1
    Don't use ragged arrays. Make a list of arrays if you need to and loop – Daniel F Jun 01 '17 at 07:49
  • @IgnacioVergaraKausel I can't add padding of np.nan to this array and still get an accurate mean and standard deviation. – Michael Hackman Jun 01 '17 at 07:55
  • @MichaelHackman: Are you sure about your output ? I obtain: `[array([-1.06904497, 0.26726124, 1.60356745]), array([-1.06904497, 0.26726124])]` – Nuageux Jun 01 '17 at 07:56
  • @Nuageux I am subtracting the overall mean from each element in the array, then dividing by the overall standard deviation – Michael Hackman Jun 01 '17 at 07:56
  • 3
    @MichaelHackman yes you can get an accurate [mean](https://docs.scipy.org/doc/numpy/reference/generated/numpy.nanmean.html) and [std](https://docs.scipy.org/doc/numpy/reference/generated/numpy.nanstd.html#numpy.nanstd) with nan arrays. – Imanol Luengo Jun 01 '17 at 07:57
  • @Nuageux Yes, you were right, I forgot to change the code I was working with. Edited to reflect the changes. – Michael Hackman Jun 01 '17 at 07:59
  • @ImanolLuengo That is a very handy tip, didn't know about nanmean – Michael Hackman Jun 01 '17 at 08:02
  • If you are looking for a vectorized method, get a regular array filled with NaNs for those empty places from this [**`post`**](https://stackoverflow.com/a/40571482/3293881) and then simply use nan funcs : `(a-np.nanmean(a))/np.nanstd(a)`, where `a` is the NaN filled array. It has some setup overhead, so performance gain if any would depend on the data format and size. – Divakar Jun 01 '17 at 08:07

2 Answers2

5

Don't make ragged arrays. Just don't. Numpy can't do much with them, and any code you might make for them will always be unreliable and slow because numpy doesn't work that way. It turns them into object dtypes:

Sample
array([[1, 2, 3], [1, 2]], dtype=object)

Which almost no numpy functions work on. In this case those objects are list objects, which makes your code even more confusing as you either have to switch between list and ndarray methods, or stick to list-safe numpy methods. This a recipe for disaster as anyone noodling around with the code later (even yourself if you forget) will be dancing in a minefield.

There's two things you can do with your data to make things work better:

First method is to index and flatten.

i = np.cumsum(np.array([len(x) for x in Sample]))
flat_sample = np.hstack(Sample)

This preserves the index of the end of each sample in i, while keeping the sample as a 1D array

The other method is to pad one dimension with np.nan and use nan-safe functions

m = np.array([len(x) for x in Sample]).max()
nan_sample = np.array([x + [np.nan] * (m - len(x)) for x in Sample])

So to do your calculations, you can use flat_sample and do similar to above:

new_flat_sample = (flat_sample - np.mean(flat_sample)) / np.std(flat_sample) 

and use i to recreate your original array (or list of arrays, which I recommend:, see np.split).

new_list_sample = np.split(new_flat_sample, i[:-1])

[array([-1.06904497,  0.26726124,  1.60356745]),
 array([-1.06904497,  0.26726124])]

Or use nan_sample, but you will need to replace np.mean and np.std with np.nanmean and np.nanstd

new_nan_sample = (nan_sample - np.nanmean(nan_sample)) / np.nanstd(nan_sample)

array([[-1.06904497,  0.26726124,  1.60356745],
       [-1.06904497,  0.26726124,         nan]])
Daniel F
  • 13,620
  • 2
  • 29
  • 55
3

@MichaelHackman (following the comment remark). That's weird because when I compute the overall std and mean then apply it, I obtain different result (see code below).

import numpy as np

Samples = np.array([[1, 2, 3],
                   [1, 2]])
c = np.hstack(Samples)  # Will gives [1,2,3,1,2]
mean, std = np.mean(c), np.std(c)
newSamples = np.asarray([(np.array(xi)-mean)/std for xi in Samples])
print newSamples
# [array([-1.06904497,  0.26726124,  1.60356745]), array([-1.06904497,  0.26726124])]

edit: Add np.asarray(), put mean,std computation outside loop following Imanol Luengo's excellent comments (Thanks!)

Nuageux
  • 1,668
  • 1
  • 17
  • 29