How to find median in Numpy 2d array with matching column

Question

From my basic math I know the median salary is 40000 of all jobs listed but how would I obtain that using NumPy?

eg Find the salary of the median of all jobs listed

1st column = salary

2nd column = no. of jobs advertised

 ``` x = np.array([
           [10000, 329],
           [20000, 329],
           [30000, 323],
           [40000, 310],
           [50000, 284],
           [60000, 232],
           [70000, 189],
           [80000, 130],
           [90000, 87],
           [100000, 71]]
           )

You're looking for a weighed median where the second column is the weights. This is not built into numpy, but you can write a function as demonstrated [here](https://stackoverflow.com/a/55521559/1678467) and [here](https://stackoverflow.com/a/55622669/1678467). The second link is a more general solution for quantiles (median is the 0.50 quantile). — tnknepp, Feb 23 '22 at 14:30

Alexandru Dinu · Accepted Answer · 2022-08-03T11:42:41.863

You have a frequency table. You are interested in finding the first value from x[:, 0] corresponding to where the midpoint falls on the cumulative frequency.

You can use:

def median_freq_table(freq_table: np.ndarray) -> float:
    """
    Find median of an array represented as a frequency table [[ val, freq ]].
    """
    values = freq_table[:, 0]
    freqs = freq_table[:, 1]

    # cumulative frequencies
    cf = np.cumsum(freqs)
    # total number of elements
    n = cf[-1]

    # get the left and right buckets
    # of where the midpoint falls,
    # accounting for both even and odd lengths
    l = (n // 2 - 1) < cf
    r = (n // 2) < cf

    # median is the midpoint value (which falls in the same bucket)
    if n % 2 == 1 or (l == r).all():
        return values[r][0]
    # median is the mean between the mid adjacent buckets
    else:
        return np.mean(values[l | r][:2])

Your input:

>>> xs = np.array(
    [
        [10000, 329],
        [20000, 329],
        [30000, 323],
        [40000, 310],
        [50000, 284],
        [60000, 232],
        [70000, 189],
        [80000, 130],
        [90000, 87],
        [100000, 71],
    ]
)
>>> median_freq_table(xs)
40000

Simple, even-length array:

>>> xs = np.array([[1, 3], [10, 3]])
>>> median_freq_table(xs)
5.5

How to find median in Numpy 2d array with matching column

1 Answers1