2

I would like to calculate the median line by line in a dataframe of more than 500,000 rows. For the moment I'm using np.median because numpy is optimized to run on a single core. It's still very slow and I'd like to find a way to parallel the calculation

Specifically, I have N tables of size 13 x 500,000 and for each table I want to add the columns Q1, Q3 and median so that for each row the median column contains the median of the row. So I have to calculate N * 500,000 median values.

I tried with numexpr but it doesn't seem possible.

EDIT : In fact I also need Q1 and Q3 so I can't use the statistics module which doesn't allow to calculate quartiles. Here is how I calculate the median for the moment

    q = np.transpose(np.percentile(data[row_array], [25,50,75], axis = 1))
    data['Q1_' + family] = q[:,0]
    data['MEDIAN_' + family] = q[:,1]
    data['Q3_' + family] = q[:,2]

EDIT 2 I solved my problem by using the median of median algorithm as proposed below

3 Answers3

2

If a (close) approximation of the median is OK for your purposes, you should consider computing a median of medians, which is a divide and conquer strategy that can be executed in parallel. In principle, MoM has O(n) complexity for serial execution, approaching O(1) for parallel execution on massively parallel systems.

See this Wiki entry for a description and pseudo-code. See also this question on Stack Overflow and discussion of the code, and this ArXiv paper for a GPU implementation.

MPA
  • 1,878
  • 2
  • 26
  • 51
1

Courtesy of @dahhiya_boy

You can use median() from the statistics module

import statistics

statistics.median(items)

You can calculate Q1 by taking the median of median() and min(), and you can calculate Q3 by taking the median of median() and max(). If you find this messy, just define a quartile_median() function that returns Q1,Q2,Q3

Alec
  • 8,529
  • 8
  • 37
  • 63
  • Look at the edit part of the question, thank you for your answer – Pierre Carceller Mar 26 '19 at 09:51
  • that's what @dahhiya_boy proposed, but as I said in my answer, I don't think it's efficient in terms of computing time. (Correct me if I'm wrong) – Pierre Carceller Mar 26 '19 at 10:00
  • I'm not an expert on the subject but I'm pretty sure that's how a quartile function calculates quartiles anyways. The definition of a quartile is the median of the median and an extrema, I don't think there's a faster way to calculate it – Alec Mar 26 '19 at 10:02
  • you are perfectly right, but to calculate the median you have to sort the table, with numpy the table is sorted once, with the statistics module there are 3 function calls so the table is sorted 3 times. That's why I think it's actually going to slow down the program. – Pierre Carceller Mar 26 '19 at 10:09
  • That's a good point, but considering `sort()` time is nlog2(n), it shouldn't be too much of an issue – Alec Mar 26 '19 at 10:18
  • @PierreCarceller I don't think `numpy` is slow, I have tried **500,000** randomInt values to calculate same datas (median, Q1, Q2), process always ends in **0.2s** or **0.3s** – dahiya_boy Mar 26 '19 at 10:19
  • That's not exactly my problem. I have N table of size 13 x 500 000 and for each of the tables I want to add the columns Q1, Q3 and median of so that in for each row the median column contains the median of the row. So I have to calculate N * 500,000 median values – Pierre Carceller Mar 26 '19 at 10:31
0

From what I understood you want to compute the quantiles row by row. You can simply transpose your dataframe and then apply pandas.DataFrame.quantile Not sure this is optimal thou.

q=data.quantile([0.25,0.50,0.75],axis=0)

if you have IPython active you can use the line magic: %time before the line to check the run time.

%time
q=data.quantile([0.25,0.50,0.75],axis=0)

This returns: Wall time: 0 ns to me.

CAPSLOCK
  • 6,243
  • 3
  • 33
  • 56