1

From my basic math I know the median salary is 40000 of all jobs listed but how would I obtain that using NumPy?

eg Find the salary of the median of all jobs listed

  • 1st column = salary

  • 2nd column = no. of jobs advertised

     ``` x = np.array([
               [10000, 329],
               [20000, 329],
               [30000, 323],
               [40000, 310],
               [50000, 284],
               [60000, 232],
               [70000, 189],
               [80000, 130],
               [90000, 87],
               [100000, 71]]
               )
    
Dr Pi
  • 417
  • 3
  • 9
  • You're looking for a weighed median where the second column is the weights. This is not built into numpy, but you can write a function as demonstrated [here](https://stackoverflow.com/a/55521559/1678467) and [here](https://stackoverflow.com/a/55622669/1678467). The second link is a more general solution for quantiles (median is the 0.50 quantile). – tnknepp Feb 23 '22 at 14:30

1 Answers1

2

You have a frequency table. You are interested in finding the first value from x[:, 0] corresponding to where the midpoint falls on the cumulative frequency.

You can use:

def median_freq_table(freq_table: np.ndarray) -> float:
    """
    Find median of an array represented as a frequency table [[ val, freq ]].
    """
    values = freq_table[:, 0]
    freqs = freq_table[:, 1]

    # cumulative frequencies
    cf = np.cumsum(freqs)
    # total number of elements
    n = cf[-1]

    # get the left and right buckets
    # of where the midpoint falls,
    # accounting for both even and odd lengths
    l = (n // 2 - 1) < cf
    r = (n // 2) < cf

    # median is the midpoint value (which falls in the same bucket)
    if n % 2 == 1 or (l == r).all():
        return values[r][0]
    # median is the mean between the mid adjacent buckets
    else:
        return np.mean(values[l | r][:2])

Your input:

>>> xs = np.array(
    [
        [10000, 329],
        [20000, 329],
        [30000, 323],
        [40000, 310],
        [50000, 284],
        [60000, 232],
        [70000, 189],
        [80000, 130],
        [90000, 87],
        [100000, 71],
    ]
)
>>> median_freq_table(xs)
40000

Simple, even-length array:

>>> xs = np.array([[1, 3], [10, 3]])
>>> median_freq_table(xs)
5.5
Alexandru Dinu
  • 1,159
  • 13
  • 24