2

I'm trying to find the most frequent elements in an two dimensional numpy array. I want them row-wise or column-wise. I searched docs and web but I couldn't find exactly what I'm looking for. Let me explain with an example; assume I have an arr as following:

import numpy as np
arr = np.random.randint(0, 2, size=(5, 2))
arr

# Output
array([[1, 1],
       [0, 0],
       [0, 1],
       [1, 1],
       [1, 0]])

The expected output is an array that contains the most frequent elements in columns or rows depending on the given axis input. I know that np.unique() returns count of each unique value in the input array for given axis. So, it counts unique rows or columns in 2-D array:

np.unique(arr, return_counts=True, axis=0)

# Output
(array([[0, 0],
       [0, 1],
       [1, 0],
       [1, 1]]), array([1, 1, 1, 2]))

So, it tells that the unique elements [0, 0], [0, 1] and [1, 0] occur once whereas [1, 1] occurs twice in the arr. This does not work for me. Because I need to see the most frequent elements in rows (or columns). So my expected output is as follows:

array([[1, 1],    # --> 1
       [0, 0],    # --> 0
       [0, 1],    # --> 0 or 1 since they have same frequency
       [1, 1],    # --> 1
       [1, 0]])   # --> 0 or 1 since they have same frequency

Consequently, the result can be array([1, 0, 0, 1, 0]) or array([1, 0, 1, 1, 1]) with shape (5, ).

PS:

I know that the solution can be found by iterating over columns or rows and finding most frequent elements using np.unique(), however I want to find the most efficient way of doing this. Because, generally numpy is used for vectorized calculations for huge sized arrays and in my case the input array arr has too much elements. The computation will be costly with a for loop.

EDIT:

To be more clear, I added a loop based solution. Since the arr can contain not only 0s and 1s but also varying elements , I decided to use a different randomized arr

arr = np.random.randint(1, 4, size=(10, 3)) * 10

# arr:
array([[30, 30, 30],
       [10, 20, 30],
       [30, 30, 30],
       [30, 10, 20],
       [20, 20, 10],
       [20, 30, 20],
       [20, 30, 10],
       [10, 30, 10],
       [20, 10, 10],
       [20, 30, 30]])

most_freq_elem_in_rows = []
for row in arr:
  elements, counts = np.unique(row, return_counts=True)
  most_freq_elem_in_rows.append(elements[np.argmax(counts)])

# most_freq_elem_in_rows:
# [30, 10, 30, 10, 20, 20, 10, 10, 10, 30]

most_freq_elem_in_cols = []
for col in arr.T:
  elements, counts = np.unique(col, return_counts=True)
  most_freq_elem_in_cols.append(elements[np.argmax(counts)])

# most_freq_elem_in_cols:
# [20, 30, 10]

Then, most_freq_elem_in_rows and most_freq_elem_in_cols can be converted numpy arrays simply using np.array()

cottontail
  • 10,268
  • 18
  • 50
  • 51
Ersel Er
  • 731
  • 6
  • 22

1 Answers1

5

If you can add scipy dependency, then scipy.stats.mode achieves that:

import numpy as np
from scipy.stats import mode

arr = np.random.randint(0, 2, size=(5, 2))

mode(arr, 0)
ModeResult(mode=array([[0, 0]]), count=array([[3, 3]]))

mode(arr,1)
ModeResult(mode=array([[0],
                       [1], 
                       [0],
                       [0],
                       [0]]), 
           count=array([[1],
                        [2],
                        [2],
                        [2],
                        [1]]))
FBruzzesi
  • 6,385
  • 3
  • 15
  • 37