NumPy does not provide a dedicated method for calculating the mode of some data. One reason for this could be that the mode is often used for non-numeric, categorical variables, while NumPy is focused on numeric calculations.
Here is an alternative using pandas.DataFrame.mode(). It supports mixed-type data, see further below for an example.
import pandas as pd
data = [[1, 3, 4, 2, 2, 7],
[5, 2, 2, 1, 4, 1],
[3, 3, 2, 2, 1, 1]])
df = pd.DataFrame(data)
df.mode()
# 0 1 2 3 4 5
# 0 1 3.0 2.0 2.0 1 1.0
# 1 3 NaN NaN NaN 2 NaN
# 2 5 NaN NaN NaN 4 NaN
Here, we are interested only in the first row. To fetch it, use one of the following:
modes = df.mode().values[0] # array([1., 3., 2., 2., 1., 1.])
modes = df.mode().iloc[0] # pd.Series(...)
Details:
- By default, pandas computes the column-wise modes. One can compute the row-wise modes by passing the argument
axis=1
: df.mode(axis=1)
- Starting with SciPy 1.9, the support for non-numeric data has been deprecated and will not be possible for SciPy >=1.11. See docs of scipy.stats.mode(). SciPy recommends using the Pandas approach.
- Pandas sorts the modes if there are multiple ones. If we just use the first row of the resulting DataFrame, we slightly deviate from the OPs question, who requested to pick one randomly. Of course, we can fix this, see below.
- The function mode() yields all possible modes if there are more than one, and stores them in a DataFrame. Unfortunately, this results in NaN values for the columns with fewer modes than the column with the maximal number of modes. In order to accommodate the NaNs, Pandas converts the dtype of the columns from int to float, which I consider a bit ugly. To recover from this, we need to force the original dtype. The code below shows how to do this.
Fix 1: Recover from typecast int → float:
# Works for both np.ndarray, pd.Series
modes.astype(int)
# For a mixed-type DataFrame, one could do the following:
# (Works only for column-wise modes)
[dtype.type(m) for m, dtype in zip(modes, df.dtypes)]
Fix 2: Pick a mode at random if there are multiple
modes = df.mode().apply(lambda x: np.random.choice(x.dropna()))
Example: Mixed-type data
import numpy as np
import pandas as pd
data = {"col1": ["foo", "bar", "baz", "foo", "bar", "foo", "bar", "baz"],
"col2": [10, 0, 0, 10, 10, 10, 0, 10],
"col3": [42., 14., 0.1, 1., 1., 4., 42., 14.],
"col4": [False, False, False, True, True, True, False, True],
"col5": [None, "abc", "abc", None, "def", "def", None, "abc"],
"col6": [1.2, None, 1.2, 2.3, None, 2.3, 1.2, 2.3] }
df = pd.DataFrame(data)
# col1 col2 col3 col4 col5 col6
# 0 foo 10 42.0 False None 1.2
# 1 bar 0 14.0 False abc NaN
# 2 baz 0 0.1 False abc 1.2
# 3 foo 10 1.0 True None 2.3
# 4 bar 10 1.0 True def NaN
# 5 foo 10 4.0 True def 2.3
# 6 bar 0 42.0 False None 1.2
#
# dtype object int64 float64 bool object float64
modes = df.mode()
# col1 col2 col3 col4 col5 col6
# 0 bar 10.0 1.0 False abc 1.2
# 1 foo NaN 14.0 True NaN 2.3
# 2 NaN NaN 42.0 NaN NaN NaN
#
# dtype object float64 float64 object object float64
Note how the Nones are handled in the data, how multiple modes are sorted, and that the dtypes for col2 and col4 have changed.
Finally, we can fix the typecast and pick the mode at random if there are multiple:
modes_fixed = modes.apply(lambda x: np.random.choice(x.dropna()))
modes_fixed = [dtype.type(m) for m, dtype in zip(modes_fixed, df.dtypes)]
# ['foo', 10, 14.0, False, 'abc', 2.3]