C++ fastest way for calculating rolling median

Question

I am trying to solve a hackerrank problem on "Fraudulent Notification". While I passes some of the test cases, i get time limit exception as my algorithm is not efficient enough. Note i am trying to do it in c++ and using std::multimap as element would be in sorted order and hence insertion, deletion etc. from data structure would be relatively fast. Link to the hackerank problem is : https://www.hackerrank.com/challenges/fraudulent-activity-notifications/problem. Moreover one of the test cases against which i get time limit exceeded is at the link https://hr-testcases-us-east-1.s3.amazonaws.com/26114/input01.txt?AWSAccessKeyId=AKIAR6O7GJNX5DNFO3PV&Expires=1681754385&Signature=O64%2FYgms0Psub8VZQQpvlGYHEiw%3D&response-content-type=text%2Fplain.

Note i did see some of the question like C++ Efficiently Calculating a Running Median in SO here, but my understanding to some of these order statistic tree and advanced template is very limited, so I thought if there is something else or a more easier and efficient approach to the problem. Thanks in advance.

int activityNotifications(vector<int> expenditure, int d) {
    int total_warning=0;
    //std::vector<double> median_arr;
    std::multiset<double> arr;
    int total_days=expenditure.size();
    for(int i=0;i<d;++i)
    {
        arr.insert(expenditure[i]);
    }

    for(int i=d,j=0;i<total_days;++i,++j)
    {
        double median;
        double a = *next(arr.begin(),d/2);
        double b = *next(arr.begin(),d/2 + 1);
        if(d & 1)
        {
            median=a;
            //median_arr.push_back(a);
            //std::cout <<"Median is b:" << median << endl;
        }
        else 
        {
            median=(double(a+b))/2;
            //median_arr.push_back((a+b)/2);
            //std::cout <<"Median is a+b :" << median << endl;
        }
        arr.insert(expenditure[i]);
        arr.erase(arr.find(expenditure[j]));
        //std::cout <<"Median * 2 :" << median*2 << endl;
        //std::cout <<"Expenditure :" << expenditure[i] << endl;
        if(expenditure[i] >= (median*2) )
        {
            ++total_warning;
        }
    }
    return total_warning;
}

Pretty close to a duplicate of: https://stackoverflow.com/q/7842347/179910 — Jerry Coffin, Apr 17 '23 at 16:35
The top answer on the question you linked doesn't have any "statistic tree" or advanced templates, it has a max heap and a min heap. — Useless, Apr 17 '23 at 16:45
@PepijnKramer It is more for preparing for algorithm interview rather than learning cpp. But yes completely agree with you. I think there are many good algos book that i can refer to. — Invictus, Apr 17 '23 at 20:01
@Useless I tried using creating a min and max heap with max difference in sizes between them as 1. But when I need to remove/erase the expense from earliest day and insert new expenditure i am not sure how to erase an element from a priority queue without significant performance. One way is to create a new heap by iterating and discarding element to remove. There is no easy way to erase from a priority queue ? — Invictus, Apr 18 '23 at 13:11
Hackerrank's editor is unusable on my device so I won't try it myself, but I suggest you try it with a multiset [like this](https://leetcode.com/problems/sliding-window-median/discuss/96340/O(n-log-k)-C++-using-multiset-and-updating-middle-iterator). That is, keep an iterator pointing at the middle and only adjust it slightly as needed, instead of always using `next`with large step numbers. I'd like to see that solution. — Kelly Bundy, Apr 27 '23 at 22:17
(Because I think your `next` calls take Θ(d) time, whereas those alight adjustments only take Θ(1) time.) — Kelly Bundy, Apr 27 '23 at 22:30

Matt Timmermans · Answer 1 · 2023-04-18T13:46:27.800

The constraints on the problem indicate the strategies you can use and how long you get to run:

N <= about 10^5 usually means that quadratic time is acceptable. You could use quickselect on the preceding d items to find the median.
Each expenditure element <= 200 means that you can use quick and dirty replacements for an order statistic tree, to solve the problem in O(n log d) time. For example, maintain 4 arrays that keep track of the number of elements with each possible value, the number of elements in each group of 4 values, the number of elements in each group of 16 values, and the number of elements in each group of 64 values.

Here's an example of how to do the order statistic thing:

std::vector<int> h0(201,0), h1(26,0), h2(13,0), h3(4,0);

// add a value v, given v<=200
h0[v] += 1;
h1[v>>2] += 1;
h2[v>>4] += 1;
h3[v>>6] += 1;

// remove a value v, given v<=200
h0[v] -= 1;
h1[v>>2] -= 1;
h2[v>>4] -= 1;
h3[v>>6] -= 1;

// count how many values are less than v, given v<=200
int count=0;
while(v&3) count+=h0[--v];
v>>=2;
while(v&3) count+=h1[--v];
v>>=2;
while(v&3) count+=h2[--v];
v>>=2;
while(v) count+=h3[--v];

// find the ith largest value
v=0;
for(int n = h3[v]; n<=i; i-=n, ++v);
v<<=2;
for(int n = h2[v]; n<=i; i-=n, ++v);
v<<=2;
for(int n = h1[v]; n<=i; i-=n, ++v);
v<<=2;
for(int n = h0[v]; n<=i; i-=n, ++v);

Did not quiet get your second point, is it possible to explain that a bit ? Thanks — Invictus, Apr 18 '23 at 13:12

Invictus · Accepted Answer · 2023-04-18T14:58:32.647

I tried to use the heap approach with 2 multisets. Data for trailing window hence got divided into 2 parts with with either d/2 elements each or d/2 and d/2 +1 element. I used multiset as elements to be stored ( expenditure for a day in this case can be same). This reduced my sorting time as I needed to sort at max d/2 + 1 elements, with most elements already sorted. So it's time complexity would be ideally same as inserting a new element in multiset.

void insert_min_max_heap(std::multiset<int>& max_heap,std::multiset<int>& min_heap, int value)
{
        if(min_heap.size() == max_heap.size())
        {
            if (min_heap.size()==0)
            {
                min_heap.insert(value); // If both heap 0 element can be inserted in any of the heap. 
            }
            else if(value <= *(min_heap.begin())) // If sizes are same and new element to be inserted is < first element of the min heap then i would insert it on max heap otherwise in min heap.
                max_heap.insert(value); 
            else
                min_heap.insert(value);   
        }
        else // If heap size are not same idea is to keep a maximum difference of size 1  and hence depending on incoming element value we might or might not need to move first or last element of min_heap or max_heap to the other side.
        {
            if(min_heap.size() > max_heap.size())
            {
                int temp=*(min_heap.begin());
                if( value > *(min_heap.begin() ))
                {
                    auto it = min_heap.find(*(min_heap.begin()));
                    min_heap.erase(it);
                    min_heap.insert(value);
                    max_heap.insert(temp);
                }
                else
                {
                    max_heap.insert(value);
                }
            }
            //Below could be else if instead of if but then we would need else block. Since comparison operator has made check for == above and > then above, so insertion in > check part in above woul max make min_heap same as max_heap hence below criteria would not satisfy if above if condition is already satisfied.
            if(min_heap.size() < max_heap.size())
            {
                int temp=(*max_heap.rbegin());
                if( value < (*max_heap.rbegin()))
                {
                    auto it = max_heap.find(*(max_heap.rbegin()));
                    max_heap.erase(it);
                    max_heap.insert(value);
                    min_heap.insert(temp);
                }
                else
                {
                    min_heap.insert(value);
                }
            }
        }
    
}
int activityNotifications(vector<int> expenditure, int d) {
    int total_days = expenditure.size();
    int total_warning=0;
    std::multiset<int> max_heap;
    std::multiset<int> min_heap;
    double median=0.0;
    
    for(int i=0;i<d;++i)
    {
        insert_min_max_heap(max_heap,min_heap,expenditure[i]);
    }
    
    for(int i=d,j=0;i<total_days;++i,++j)
    {
        if(d & 1)
        {
            if(max_heap.size() > min_heap.size()) 
                median=(*max_heap.rbegin()); // we are taking last element of multiset from max_heap if max_heap > min_head {1,2,3} and {4,5} median would be 3 which is last element of max_heap. 
            else
                median=*min_heap.begin(); // we are taking first element of multiset from min_heap if min_heap > max_heap {1,2} and {3,4,5} median would be 3 which is first element of min_heap
        }
        else 
        {
            median=(*max_heap.rbegin() + *min_heap.begin())/2.0; // In case both min and max heap has same number of elements we need last element from max heap and first element from min heap and take mean of it.
        }
        
        if(*min_heap.begin() <= expenditure[j])
        {
            min_heap.erase(min_heap.find(expenditure[j])); // erase the element which is going out of trailing window size ( d ) if present in min_heap. 
        }
        else 
        {
              max_heap.erase(max_heap.find(expenditure[j])); // erase the element which is going out of trailing window size ( d ) if present in max_heap. 
        }
        
        if(expenditure[i] >= (median*2) )
        {
            ++total_warning; // Add count of warning if expenditure >= two times median.
        }
        insert_min_max_heap(max_heap,min_heap,expenditure[i]); // Add the latest expense to the heap for calculation of new median. 
    }
    return total_warning;
}

C++ fastest way for calculating rolling median

2 Answers2

Linked