5

I have a set of data (X,Y). My independent variable values X are not unique, so there are multiple repeated values, I want to output a new array containing : X_unique, which is a list of unique values of X. Y_mean, the mean of all of the Y values corresponding to X_unique. Y_std, the standard deviation of all the Y values corresponding to X_unique.

x = data[:,0]
y = data[:,1]
Divakar
  • 218,885
  • 19
  • 262
  • 358
obtmind
  • 287
  • 4
  • 12
  • 1
    Can you add a [Minimal, Complete, and Verifiable example](http://stackoverflow.com/help/mcve) to your question? – Mazdak Jan 05 '16 at 17:24
  • Have a look at http://stackoverflow.com/questions/4373631/sum-array-by-number-in-numpy – das-g Jan 05 '16 at 17:32
  • 1
    Aside: if you're working with actual data, you're probably going to find it easier to use [`pandas`](http://pandas.pydata.org) than bare numpy. If your `data` was a `DataFrame` instead of an `ndarray`, something like `df.groupby(0)[1].agg(["mean", "std"])` would work.. – DSM Jan 05 '16 at 18:23

3 Answers3

4

You can use binned_statistic from scipy.stats that supports various statistic functions to be applied in chunks across a 1D array. To get the chunks, we need to sort and get positions of the shifts (where chunks change), for which np.unique would be useful. Putting all those, here's an implementation -

from scipy.stats import binned_statistic as bstat

# Sort data corresponding to argsort of first column
sdata = data[data[:,0].argsort()]

# Unique col-1 elements and positions of breaks (elements are not identical)
unq_x,breaks = np.unique(sdata[:,0],return_index=True)
breaks = np.append(breaks,data.shape[0])

# Use binned statistic to get grouped average and std deviation values
idx_range = np.arange(data.shape[0])
avg_y,_,_ = bstat(x=idx_range, values=sdata[:,1], statistic='mean', bins=breaks)
std_y,_,_ = bstat(x=idx_range, values=sdata[:,1], statistic='std', bins=breaks)

From the docs of binned_statistic, one can also use a custom statistic function :

function : a user-defined function which takes a 1D array of values, and outputs a single numerical statistic. This function will be called on the values in each bin. Empty bins will be represented by function([]), or NaN if this returns an error.

Sample input, output -

In [121]: data
Out[121]: 
array([[2, 5],
       [2, 2],
       [1, 5],
       [3, 8],
       [0, 8],
       [6, 7],
       [8, 1],
       [2, 5],
       [6, 8],
       [1, 8]])

In [122]: np.column_stack((unq_x,avg_y,std_y))
Out[122]: 
array([[ 0.        ,  8.        ,  0.        ],
       [ 1.        ,  6.5       ,  1.5       ],
       [ 2.        ,  4.        ,  1.41421356],
       [ 3.        ,  8.        ,  0.        ],
       [ 6.        ,  7.5       ,  0.5       ],
       [ 8.        ,  1.        ,  0.        ]])
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • Didn't know about the existence of `binned_statistic`. I will probably use it a lot in the near future! I was writing cython code to achieve similar things lol! thanks! – Imanol Luengo Jan 05 '16 at 22:30
  • @imaluengo I knew it could get average values, but I wasn't sure about standard deviation,and it worked! The source is this answer - http://stackoverflow.com/a/29894547/3293881. Seems really neat to have something natively with NumPy arrays! – Divakar Jan 05 '16 at 22:35
2
x_unique  = np.unique(x)
y_means = np.array([np.mean(y[x==u]) for u in x_unique])
y_stds = np.array([np.std(y[x==u]) for u in x_unique])
Peter
  • 12,274
  • 9
  • 71
  • 86
1

Pandas is done for such task :

data=np.random.randint(1,5,20).reshape(10,2)
import pandas
pandas.DataFrame(data).groupby(0).mean()

gives

          1
0          
1  2.666667
2  3.000000
3  2.000000
4  1.500000
B. M.
  • 18,243
  • 2
  • 35
  • 54