21

I have a numpy ndarray with shape of (30,480,640), the 1th and 2th axis representing locations(latitude and longitute), the 0th axis contains actual data points.I want to use the most frequent value along the 0th axis at each location, which is to construct a new array with shape of (1,480,640).ie:

>>> data
array([[[ 0,  1,  2,  3,  4],
        [ 5,  6,  7,  8,  9],
        [10, 11, 12, 13, 14],
        [15, 16, 17, 18, 19]],

       [[ 0,  1,  2,  3,  4],
        [ 5,  6,  7,  8,  9],
        [10, 11, 12, 13, 14],
        [15, 16, 17, 18, 19]],

       [[40, 40, 42, 43, 44],
        [45, 46, 47, 48, 49],
        [50, 51, 52, 53, 54],
        [55, 56, 57, 58, 59]]])

(perform calculation)

>>> new_data 
array([[[ 0,  1,  2,  3,  4],
        [ 5,  6,  7,  8,  9],
        [10, 11, 12, 13, 14],
        [15, 16, 17, 18, 19]]])

The data points will contain negtive and positive floating numbers. How can I perform such calculations? Thanks a lot!

I tried with numpy.unique,but I got "TypeError: unique() got an unexpected keyword argument 'return_inverse'".I'm using numpy version 1.2.1 installed on Unix and it doesn't support return_inverse..I also tried mode,but it takes forever to process such large amount of data...so is there an alternative way to get the most frequent values? Thanks again.

oops
  • 651
  • 3
  • 13
  • 20

5 Answers5

27

To find the most frequent value of a flat array, use unique, bincount and argmax:

arr = np.array([5, 4, -2, 1, -2, 0, 4, 4, -6, -1])
u, indices = np.unique(arr, return_inverse=True)
u[np.argmax(np.bincount(indices))]

To work with a multidimensional array, we don't need to worry about unique, but we do need to use apply_along_axis on bincount:

arr = np.array([[5, 4, -2, 1, -2, 0, 4, 4, -6, -1],
                [0, 1,  2, 2,  3, 4, 5, 6,  7,  8]])
axis = 1
u, indices = np.unique(arr, return_inverse=True)
u[np.argmax(np.apply_along_axis(np.bincount, axis, indices.reshape(arr.shape),
                                None, np.max(indices) + 1), axis=axis)]

With your data:

data = np.array([
   [[ 0,  1,  2,  3,  4],
    [ 5,  6,  7,  8,  9],
    [10, 11, 12, 13, 14],
    [15, 16, 17, 18, 19]],

   [[ 0,  1,  2,  3,  4],
    [ 5,  6,  7,  8,  9],
    [10, 11, 12, 13, 14],
    [15, 16, 17, 18, 19]],

   [[40, 40, 42, 43, 44],
    [45, 46, 47, 48, 49],
    [50, 51, 52, 53, 54],
    [55, 56, 57, 58, 59]]])
axis = 0
u, indices = np.unique(arr, return_inverse=True)
u[np.argmax(np.apply_along_axis(np.bincount, axis, indices.reshape(arr.shape),
                                None, np.max(indices) + 1), axis=axis)]
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])

NumPy 1.2, really? You can approximate np.unique(return_inverse=True) reasonably efficiently using np.searchsorted (it's an additional O(n log n), so shouldn't change the performance significantly):

u = np.unique(arr)
indices = np.searchsorted(u, arr.flat)
ecatmur
  • 152,476
  • 27
  • 293
  • 366
  • @ ecatmur,I'm using numpy version 1.2.1 and it doesn't support np.unique(return_inverse)..any suggestions? – oops Sep 10 '12 at 09:52
  • 1
    @oops see above, you'll have to test it yourself as I have no idea where I'd even find such an old version of numpy ;) – ecatmur Sep 10 '12 at 11:56
  • For big ndarrays with more dimensions, this approach is not really suitable, since it will allocate an array of size (N_dim1,N_dim2,...,N_unique) which gets out of hand very quickly. – CheshireCat Apr 29 '20 at 11:39
9

Use SciPy's mode function:

import numpy as np
from scipy.stats import mode

data = np.array([[[ 0,  1,  2,  3,  4],
                  [ 5,  6,  7,  8,  9],
                  [10, 11, 12, 13, 14],
                  [15, 16, 17, 18, 19]],

                 [[ 0,  1,  2,  3,  4],
                  [ 5,  6,  7,  8,  9],
                  [10, 11, 12, 13, 14],
                  [15, 16, 17, 18, 19]],

                 [[40, 40, 42, 43, 44],
                  [45, 46, 47, 48, 49],
                  [50, 51, 52, 53, 54],
                  [55, 56, 57, 58, 59]]])

print data

# find mode along the zero-th axis; the return value is a tuple of the
# modes and their counts.
print mode(data, axis=0)
Taro Sato
  • 1,444
  • 1
  • 15
  • 19
  • Thank you Taro Sato,but it takes very very long time to process large arrays..any suggestion to speed it up? – oops Sep 10 '12 at 07:03
  • Okay, I noticed that you want to do this with floats. To do that, I think you need a slightly different approach, since it doesn't really make sense to ask what is the most frequent float, since there's only a small chance that two floats coincide from repeated experiments. Do you really need to find such a weird thing? I you know (roughly) the distribution of your sample, then there are better measures to compute, such as mean and median, to find out what the most likely number in your sample is. – Taro Sato Sep 10 '12 at 17:54
  • do people still widely used scipy package? Read somewhere that mean from scipy is deprecated. Just curious to know :) – Mona Jalal Jul 30 '16 at 23:07
  • It's been a while since this answer was written, so I would not be surprised if the function has been deprecated (in favor of `np.mean` for example)... but I think I was commenting on general approach, not a specific function of a package. – Taro Sato Jul 30 '16 at 23:32
1

A slightly better solution in my opinion is the following

tmpL = np.array([3, 2, 3, 2, 5, 2, 2, 3, 3, 2, 2, 2, 3, 3, 2, 2, 3, 2, 3, 2])
unique, counts = np.unique(tmpL, return_counts=True)
return unique[np.argmax(counts)]

Using np.unique we can get the count of each unique elements. The index of the max element in counts will be the corresponding element in unique.

MilesConn
  • 301
  • 2
  • 13
0

flatten your array, then build a collections.Counter from it. As usual, take special care when comparing floating-point numbers.

ev-br
  • 24,968
  • 9
  • 65
  • 78
0

Explaining @ecatmurs part

u[np.argmax(np.apply_along_axis(np.bincount, axis, indices.reshape(arr.shape),
                                None, np.max(indices) + 1), axis=axis)]

a little bit more and restructuring it to be more concise when re-reading it (because I used this solution and after a few weeks I was wondering what had happened in this function):

axis = 0
uniques, indices = np.unique(arr, return_inverse=True)

args_for_bincount_fn = None, np.max(indices) + 1
binned_indices = np.apply_along_axis(np.bincount,
                            last_axis, 
                            indices.reshape(arr.shape),
                            *args_for_bincount_fn)

most_common = uniques[np.argmax(binned_indices,axis=axis)]
Createdd
  • 865
  • 11
  • 15