Finding quantiles from np.array of different length lists

Question

I am trying to efficiently calculate the Interquartile Range, IQR, of some variable-length histogram data. I have the data in a list of lists. Each inner list is an individual histogram. Most of these histograms have a length of 100, but the length can vary between 50 -- 150 ints long.

Sample data:

list_of_hists = [
    [13, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 23, 24, 24, 24, 24, 25, 25, 25, 25, 27, 28, 28, 29, 30, 30, 30, 31, 31, 31, 31, 32, 32, 32, 32, 33, 33, 34, 34, 35, 35, 35, 35, 35, 36, 36, 42],
    [13, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 23, 24, 24, 24, 24, 25, 25, 25, 25, 27, 28, 28, 29, 30, 30, 30, 31, 31, 31, 31, 32, 32, 32, 32, 33, 33, 34, 34, 35, 35, 35, 35, 35, 36, 36, 42, 42],
    [13, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 23, 24, 24, 24, 24, 25, 25, 25, 25, 27, 28, 28, 29, 30, 30, 30, 31, 31, 31, 31, 32, 32, 32, 32, 33, 33, 34, 34, 35, 35, 35, 35, 35, 36, 36, 42, 43, 43, 43, 43],
    [10, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 16, 16, 16, 16, 16, 16, 16, 16, 16, 17, 17, 17, 19, 25, 28]
]

I am currently calculating the IQRs using a simple for loop:

list_of_iqrs = []
for hist in list_of_hists:
    iqr = np.quantile(hist, 0.75, interpolation="linear") - np.quantile(
        hist, 0.25, interpolation="linear"
    )
    list_of_iqrs.append(iqr)

Expected results for the above data:

list_of_iqrs = [10.0, 10.5, 11.5, 2.0]

Given that this list of hists is ~10**6 elements long, I am hoping to find a way to do this using an array calculation. Unfortunately, when I try to turn this into an array, I just get an array of lists:

array([
       list([13, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 23, 24, 24, 24, 24, 25, 25, 25, 25, 27, 28, 28, 29, 30, 30, 30, 31, 31, 31, 31, 32, 32, 32, 32, 33, 33, 34, 34, 35, 35, 35, 35, 35, 36, 36, 42]),
       list([13, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 23, 24, 24, 24, 24, 25, 25, 25, 25, 27, 28, 28, 29, 30, 30, 30, 31, 31, 31, 31, 32, 32, 32, 32, 33, 33, 34, 34, 35, 35, 35, 35, 35, 36, 36, 42, 42]),
       list([13, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 23, 24, 24, 24, 24, 25, 25, 25, 25, 27, 28, 28, 29, 30, 30, 30, 31, 31, 31, 31, 32, 32, 32, 32, 33, 33, 34, 34, 35, 35, 35, 35, 35, 36, 36, 42, 43, 43, 43, 43]),
       list([10, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 16, 16, 16, 16, 16, 16, 16, 16, 16, 17, 17, 17, 19, 25, 28])
])

and the quantile calculation doesn't work like I would expect.

How can I turn this list of hists into an array and find the IQR?

Edit: Another solution seems to be to append onto each hist and make them all the same length, and then turn that list of hists into an array:

list_of_hists_ = [hist + [None]*(len(max(list_of_hists, key=len))-len(hist)) for hist in list_of_hists]
np.array(list_of_hists_)

but this is very slow. Maybe I've already found the fastest way to do this?

Padding arrays to a common size has been discussed in previous SO, but I wonder that's even useful? In your example you padded with `None`, which would produce an object dtype array. Can you pad with something that doesn't mess up the `quantile` result? — hpaulj, Dec 13 '21 at 07:57
It's a good suggestion. Padding with `None` does indeed mess up `quantile`. If I pad with `np.NaN` I get a return of `nan` from `quantile`. I found [`np.nanquantile`](https://numpy.org/doc/stable/reference/generated/numpy.nanquantile.html#numpy.nanquantile) can do it, but it's slow compared with my original solution.. — Wesley Cheek, Dec 13 '21 at 08:08
The code for `nanquantile` can be studied in `np.lib.nanfunctions.py`. I don't know what's involved. For some `nan...` functions it's just a matter of removing the `nan` values or replacing them with something harmless like 0. But for quantile/percentile it's got to be more complicated. — hpaulj, Dec 13 '21 at 08:15

score 1 · Accepted Answer · answered Dec 13 '21 at 08:18

Since the data is sorted (histogram) you can utilize this characteristic for calculating the IQR in a more efficient way. It's enough to get the difference between the median of each half separately.

import numpy as np
from time import time  
list_of_hists = [
    [13, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 23, 24, 24, 24, 24, 25, 25, 25, 25, 27, 28, 28, 29, 30, 30, 30, 31, 31, 31, 31, 32, 32, 32, 32, 33, 33, 34, 34, 35, 35, 35, 35, 35, 36, 36, 42],
    [13, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 23, 24, 24, 24, 24, 25, 25, 25, 25, 27, 28, 28, 29, 30, 30, 30, 31, 31, 31, 31, 32, 32, 32, 32, 33, 33, 34, 34, 35, 35, 35, 35, 35, 36, 36, 42, 42],
    [13, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 23, 24, 24, 24, 24, 25, 25, 25, 25, 27, 28, 28, 29, 30, 30, 30, 31, 31, 31, 31, 32, 32, 32, 32, 33, 33, 34, 34, 35, 35, 35, 35, 35, 36, 36, 42, 43, 43, 43, 43],
    [10, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 16, 16, 16, 16, 16, 16, 16, 16, 16, 17, 17, 17, 19, 25, 28],
    [13, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 23, 24, 24, 24, 24, 25, 25, 25, 25, 27, 28, 28, 29, 30, 30, 30, 31, 31, 31, 31, 32, 32, 32, 32, 33, 33, 34, 34, 35, 35, 35, 35, 35, 36, 36, 42],
    [13, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 23, 24, 24, 24, 24, 25, 25, 25, 25, 27, 28, 28, 29, 30, 30, 30, 31, 31, 31, 31, 32, 32, 32, 32, 33, 33, 34, 34, 35, 35, 35, 35, 35, 36, 36, 42, 42],
    [13, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 23, 24, 24, 24, 24, 25, 25, 25, 25, 27, 28, 28, 29, 30, 30, 30, 31, 31, 31, 31, 32, 32, 32, 32, 33, 33, 34, 34, 35, 35, 35, 35, 35, 36, 36, 42, 43, 43, 43, 43],
    [10, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 16, 16, 16, 16, 16, 16, 16, 16, 16, 17, 17, 17, 19, 25, 28],
    [13, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 23, 24, 24, 24, 24, 25, 25, 25, 25, 27, 28, 28, 29, 30, 30, 30, 31, 31, 31, 31, 32, 32, 32, 32, 33, 33, 34, 34, 35, 35, 35, 35, 35, 36, 36, 42],
    [13, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 23, 24, 24, 24, 24, 25, 25, 25, 25, 27, 28, 28, 29, 30, 30, 30, 31, 31, 31, 31, 32, 32, 32, 32, 33, 33, 34, 34, 35, 35, 35, 35, 35, 36, 36, 42, 42],
    [13, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 23, 24, 24, 24, 24, 25, 25, 25, 25, 27, 28, 28, 29, 30, 30, 30, 31, 31, 31, 31, 32, 32, 32, 32, 33, 33, 34, 34, 35, 35, 35, 35, 35, 36, 36, 42, 43, 43, 43, 43],
    [10, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 16, 16, 16, 16, 16, 16, 16, 16, 16, 17, 17, 17, 19, 25, 28],

]
list_of_hists = np.array(list_of_hists)

def sortedIQR(data):
    pivot = len(data)//2
    # First quartile (Q1)
    Q1 = np.median(data[:pivot])
    # Third quartile (Q3)
    Q3 = np.median(data[pivot:])
    # Interquartile range (IQR)
    IQR = Q3 - Q1
    return IQR

def simpleIQR(hist):
    iqr = np.quantile(hist, 0.75, interpolation="linear") - np.quantile(
        hist, 0.25, interpolation="linear"
    )
    return iqr

start = time()
answers = []
for idx, item in enumerate(list_of_hists):
    answers.append(simpleIQR(item))
end = time()
print('Elapsed Time for Simple IQR: ', round(end-start, 5))
print(answers)
answers = []
start = time()
for idx, item in enumerate(list_of_hists):
    answers.append(sortedIQR(item))
end = time()
print('Elapsed Time for Sorted IQR: ', round(end-start, 5))
print(answers)

output:

Elapsed Time for Simple IQR:  0.004
[10.0, 10.5, 11.5, 2.0, 10.0, 10.5, 11.5, 2.0, 10.0, 10.5, 11.5, 2.0]
Elapsed Time for Sorted IQR:  0.001
[10.0, 10.5, 12.0, 2.0, 10.0, 10.5, 12.0, 2.0, 10.0, 10.5, 12.0, 2.0]

Thanks for this clever solution. It does indeed work. Compared with my previous time of 47.58 secs, your `sortedIQR` function can do it in 13.23 secs. I'll wait and see if someone can offer a fast solution that doesn't depend on the hists being sorted. — Wesley Cheek, Dec 13 '21 at 08:26

Finding quantiles from np.array of different length lists

1 Answers1