I'm writing a function to find the median of a set of values. The data is presented as a vector of the unique values (call them 'values') and a vector of their frequencies ('freqs'). Frequently the frequencies are very high, so pasting them out uses an exorbitant amount of memory. I have a slow R implementation and it is the major bottleneck in my code, so I am writing a custom Rcpp function for use in an R/Bioconductor package. Bioconductor's site suggests not using C++11, so that is an issue for me.
My problem lies in trying to sort the two vectors together, according to the order of the values. In R, we can just use the order() function. I cannot seem to get this to work, despite following the advice on this question: C++ sorting and keeping track of indexes
The following lines are where the problem lies:
// sort vector based on order of values
IntegerVector idx_ord = std::sort(idx.begin(), idx.end(),
bool (int i1, int i2) {return values[i1] < values[i2];});
And here is the full function, for anyone's interest. Any further tips would be greatly appreciated:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
double median_freq(NumericVector values, IntegerVector freqs) {
int len = freqs.size();
if (any(freqs!=0)){
int med = 0;
return med;
}
// filter out the zeros pre-sorting
IntegerVector non_zeros;
for (int i = 0; i < len; i++){
if(freqs[i] != 0){
non_zeros.push_back(i);
}
}
freqs = freqs[non_zeros];
values = values[non_zeros];
// find the order of values
// create integer vector of indices
IntegerVector idx(len);
for (int i = 0; i < len; ++i) idx[i] = i;
// sort vector based on order of values
IntegerVector idx_ord = std::sort(idx.begin(), idx.end(),
bool (int i1, int i2) {return values[i1] < values[i2];});
//apply to freqs and values
freqs = freqs[idx_ord];
values=values[idx_ord];
IntegerVector cum_freqs(len);
cum_freqs[0] = freqs[0];
for (int i = 1; i < len; ++i) cum_freqs[i] = freqs[i] + cum_freqs[i-1];
int total_freqs = cum_freqs[len-1];
// split into odd and even frequencies and calculate the median
if (total_freqs % 2 == 1) {
int med_ind = (total_freqs + 1)/2 - 1; // C++ indexes from 0
int i = 0;
while ((i < len) && cum_freqs[i] < med_ind){
i++;
}
double ret = values[i];
return ret;
} else {
int med_ind_1 = total_freqs/2 - 1; // C++ indexes from 0
int med_ind_2 = med_ind_1 + 1; // C++ indexes from 0
int i = 0;
while ((i < len) && cum_freqs[i] < med_ind_1){
i++;
}
double ret_1 = values[i];
i = 0;
while ((i < len) && cum_freqs[i] < med_ind_2){
i++;
}
double ret_2 = values[i];
double ret = (ret_1 + ret_2)/2;
return ret;
}
}
For anyone using the RUnit testing framework, here are some basic unit tests:
test_median_freq <- function(){
checkEquals(median_freq(1:10,1:10),7)
checkEquals(median_freq(1:10,rep(1,10)),5.5)
checkEquals(median_freq(2:6,c(1,2,1,45,2)),5)
}
Thanks!