I would like to calculate the median line by line in a dataframe of more than 500,000 rows. For the moment I'm using np.median
because numpy is optimized to run on a single core. It's still very slow and I'd like to find a way to parallel the calculation
Specifically, I have N
tables of size 13 x 500,000
and for each table I want to add the columns Q1, Q3 and median so that for each row the median column contains the median of the row. So I have to calculate N * 500,000
median values.
I tried with numexpr
but it doesn't seem possible.
EDIT : In fact I also need Q1 and Q3 so I can't use the statistics module which doesn't allow to calculate quartiles. Here is how I calculate the median for the moment
q = np.transpose(np.percentile(data[row_array], [25,50,75], axis = 1))
data['Q1_' + family] = q[:,0]
data['MEDIAN_' + family] = q[:,1]
data['Q3_' + family] = q[:,2]
EDIT 2 I solved my problem by using the median of median algorithm as proposed below