How to make the calculation of the median faster in Python

Question

I would like to calculate the median line by line in a dataframe of more than 500,000 rows. For the moment I'm using np.median because numpy is optimized to run on a single core. It's still very slow and I'd like to find a way to parallel the calculation

Specifically, I have N tables of size 13 x 500,000 and for each table I want to add the columns Q1, Q3 and median so that for each row the median column contains the median of the row. So I have to calculate N * 500,000 median values.

I tried with numexpr but it doesn't seem possible.

EDIT : In fact I also need Q1 and Q3 so I can't use the statistics module which doesn't allow to calculate quartiles. Here is how I calculate the median for the moment

    q = np.transpose(np.percentile(data[row_array], [25,50,75], axis = 1))
    data['Q1_' + family] = q[:,0]
    data['MEDIAN_' + family] = q[:,1]
    data['Q3_' + family] = q[:,2]

EDIT 2 I solved my problem by using the median of median algorithm as proposed below

Have you tried this?? `import statistics statistics.median(items)` — dahiya_boy, Mar 26 '19 at 09:20
Look at the edit part of the question, thank you for your answer — Pierre Carceller, Mar 26 '19 at 09:33
Q1 is the median of lowest value and median of dataset, so you can calculate in that way . — dahiya_boy, Mar 26 '19 at 09:45
Okay I didn't know it, but I think it's still more interesting to calculate the 3 all at once like this from a computation time point of view. In fact to speed up the calculation I was thinking of parallelizing the operations but I don't know how to do it — Pierre Carceller, Mar 26 '19 at 09:51

score 2 · Accepted Answer · answered Mar 26 '19 at 13:17

If a (close) approximation of the median is OK for your purposes, you should consider computing a median of medians, which is a divide and conquer strategy that can be executed in parallel. In principle, MoM has O(n) complexity for serial execution, approaching O(1) for parallel execution on massively parallel systems.

See this Wiki entry for a description and pseudo-code. See also this question on Stack Overflow and discussion of the code, and this ArXiv paper for a GPU implementation.

Alec · Answer 2 · 2019-03-26T09:53:53.747

1

Courtesy of @dahhiya_boy

You can use median() from the statistics module

import statistics

statistics.median(items)

You can calculate Q1 by taking the median of median() and min(), and you can calculate Q3 by taking the median of median() and max(). If you find this messy, just define a quartile_median() function that returns Q1,Q2,Q3

edited Mar 26 '19 at 09:53

answered Mar 26 '19 at 09:22

Alec

8,529
8
37
63

Look at the edit part of the question, thank you for your answer – Pierre Carceller Mar 26 '19 at 09:51
that's what @dahhiya_boy proposed, but as I said in my answer, I don't think it's efficient in terms of computing time. (Correct me if I'm wrong) – Pierre Carceller Mar 26 '19 at 10:00
I'm not an expert on the subject but I'm pretty sure that's how a quartile function calculates quartiles anyways. The definition of a quartile is the median of the median and an extrema, I don't think there's a faster way to calculate it – Alec Mar 26 '19 at 10:02
you are perfectly right, but to calculate the median you have to sort the table, with numpy the table is sorted once, with the statistics module there are 3 function calls so the table is sorted 3 times. That's why I think it's actually going to slow down the program. – Pierre Carceller Mar 26 '19 at 10:09
That's a good point, but considering `sort()` time is nlog2(n), it shouldn't be too much of an issue – Alec Mar 26 '19 at 10:18
@PierreCarceller I don't think `numpy` is slow, I have tried **500,000** randomInt values to calculate same datas (median, Q1, Q2), process always ends in **0.2s** or **0.3s** – dahiya_boy Mar 26 '19 at 10:19
That's not exactly my problem. I have N table of size 13 x 500 000 and for each of the tables I want to add the columns Q1, Q3 and median of so that in for each row the median column contains the median of the row. So I have to calculate N * 500,000 median values – Pierre Carceller Mar 26 '19 at 10:31

CAPSLOCK · Answer 3 · 2019-03-26T11:06:16.243

0

From what I understood you want to compute the quantiles row by row. You can simply transpose your dataframe and then apply pandas.DataFrame.quantile Not sure this is optimal thou.

q=data.quantile([0.25,0.50,0.75],axis=0)

if you have IPython active you can use the line magic: %time before the line to check the run time.

%time
q=data.quantile([0.25,0.50,0.75],axis=0)

This returns: Wall time: 0 ns to me.

edited Mar 26 '19 at 11:06

answered Mar 26 '19 at 09:53

CAPSLOCK

6,243
3
33
56

Very interesting, I'll see if it saves me some computing time – Pierre Carceller Mar 26 '19 at 10:06
Tested , it's better but not significantly – Pierre Carceller Mar 26 '19 at 10:23
I believe a big slow down comes from transposing the dataframe. Can you check with the profiler? – CAPSLOCK Mar 26 '19 at 10:28
I mean if you can check the computation time spent on each function and show us the results – CAPSLOCK Mar 26 '19 at 10:47
@PierreCarceller Check the updated answer. You can actually specify on which axis you want to compute the quantiles =). This should considerably speed up the computation. – CAPSLOCK Mar 26 '19 at 10:59
Right ! I will try it asap – Pierre Carceller Mar 26 '19 at 11:59
It's interesting but it's still too long, I think the only solution is parallelization – Pierre Carceller Mar 26 '19 at 12:14

How to make the calculation of the median faster in Python

3 Answers3