What is a better way to get the Median of a Probability Map?

Question

Background

I've written a program designed to calculate the odds of each outcome associated with a Die Roll, or combinations of Die Rolls. Specifically, when handling rolls like "add together two six-sided dice" (aka the Catan Roll) the way that probabilities are calculated is by saving Mappings between individual outcomes, and the number of possible rolls that can represent that outcome. In this example, the data structure is a TreeMap<Integer, BigInteger>, which takes this form:

{
[2, 1],
[3, 2],
[4, 3],
[5, 4],
[6, 5],
[7, 6],
[8, 5],
[9, 4],
[10, 3],
[11, 2],
[12, 1]
}

Before anyone asks, the use of BigInteger in this context is NOT overkill, because it's designed to handle any possible roll that might be presented to the program, and rolls like 100d6 (add together the rolls of 100 six-sided dice) very quickly end up with very large numbers that I don't want approximated as a double.

As part of the interface for this program, I've decided that I want stats about these rolls to be queryable, and one such stat I want the program to look up is the median of the roll. My current version of the algorithm sums up the trials that represent all the outcomes lower than a given outcome, starting with the lowest outcome, and reports the outcome as the median if the total exceeds 50% of the trials.

This is how I've written the code.

//Is filled with values during object construction
TreeMap<Integer, BigInteger> probabilityMap = new TreeMap<>();

//Memoization to at least make sure we only make this calculation once
private Integer memoizedMedian = null;
public int getMedian() {
    if(memoizedMedian == null) {
        BigInteger trials = BigInteger.ZERO;
        BigInteger totalTrials = numOfTrials();
        for(Map.Entry<Integer, BigInteger> entry : probabilityMap.entrySet()) {
            //We're guaranteed to be iterating in order, due to how TreeMap's work
            trials = trials.add(entry.getValue());
            double percentile = trials.doubleValue() / totalTrials.doubleValue();
            if(percentile >= 0.5) {
                memoizedMedian = entry.getKey();
                break;
            }
        }
        //If we didn't find it, someone went wrong with the object initialization
        if(memoizedMedian == null)
            throw new RuntimeException("Probability Map was not properly Initialized");
    }
    return memoizedMedian;
}

The Problem

As written, this code works: it finds the median for any set of values I pass to it.

The problem however is that it can be slow: consider this (exaggerated) dataset to see how this might be an issue:

{
[1,1],
[2,1],
[3,1],
[4,1],
[5,1],
[6,1],
[7,1],
[8,1],
[9,1],
[10,11]
}

Obviously, 10 is the median of this dataset, but the algorithm won't figure that out until it scans the very last entry in the map, and for larger, more complicated probability maps, that could take a while to reach.

So I'd like to improve the algorithm to handle these kinds of datasets more responsibly, but I'm not sure what kind of approach to take.

What kind of changes should I make to my algorithm to better calculate the median of this dataset? I'm open to changes to the root data structure as well, but that should have proper justification.

@forpas I definitely mean median. The average always requires me to iterate the entire dataset regardless, so there's not a lot I can do to optimize (except for memoizing the result, obviously). — Xirema, Nov 07 '18 at 19:59
I would just flatten the dataset (or create a flat version of the data at the beginning) , sort it and look at the middle element. If the set is dynamic, adding an element to an already sorted list takes O(log n) and looking up the median takes constant time. If the original dataset is large, [this link](https://rcoh.me/posts/linear-time-median-finding/) describes how to find the median in linear time. The key -in my opinion- is to have a flat copy of the data. — jrook, Nov 07 '18 at 19:59
@jrook You may want to review the parameters of this issue. For that to work, I'd have to create a flattened version of the data that could be larger than the maximum value of `BigInteger`, which is..... very big, to say the least. Flattening the dataset is not a sensible solution unless the original data structure gets changed in some dramatic manner, and you haven't specified how I would go about doing that. — Xirema, Nov 07 '18 at 20:04
Larger than max value of BigInteger??? You mean you will have an array with `(2 ^ 32) ^ Integer.MAX_VALUE` [elements](https://stackoverflow.com/questions/12693273/is-there-an-upper-bound-to-biginteger)? I would imagine that even the most efficient algorithm will take a lot of time to run on such a large dataset. As explained in the link I provided, median of medians is usually pretty close to the actual median. So if you don't require an exact median, that might help speed up things. — jrook, Nov 07 '18 at 20:08
@jrook To give some context, [I spit out the results of this program calculating `100d6`](https://gist.github.com/Xirema/a2c5f7588e8fb6af3066bbbd1f7f066e), to show how many trials can be associated with a single outcome. The mode, median, and mean of this dataset, 350, has 15237092858379903128111407924086725562812976591205826140530848189030092709496 trials associated with it... that would be quite difficult to flatten out! — Xirema, Nov 07 '18 at 20:18
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/183260/discussion-between-xirema-and-jrook). — Xirema, Nov 07 '18 at 21:29

Compass · Answer 1 · 2018-11-07T21:04:47.610

I don't have experience with the sheer number of possibilities for your 100d6 example, so this may or may not be the best-optimized approach, but it front-loads the intensive operations to when you create the probability map by using a pair of buckets for large and small values. This is also order-dependent, though an order-independent one can be created through a two-way rebalancing method. I went ahead and used Integers just to be able to get away with basic math operations.

Initial entries will be very volatile and require a ton of rebalancing. The obvious downside of this is that your creation performance takes a hit, but your median performance becomes O(1).

The small bucket always contains the median, which can be found as max(smallbucket.keySet). The large bucket contains everything above the keyset and is only kept for rebalancing purposes. Note that this is not the median in the event that the true median falls between two rolls, i.e. the Median for a 1d2 is 0.5 which cannot be returned if you are using only Integer for median.

public class MedianMap {
    TreeMap<Integer, Integer> smallBucket = new TreeMap<>();    
    TreeMap<Integer, Integer> largeBucket = new TreeMap<>();

    Integer smallBucketSize = 0;
    Integer largeBucketSize = 0;
    Integer median = 0;

    public void add(int value, int trials) {
        //initial state is smallBucket should have more trials than largeBucket
        largeBucket.add(value, trials);
        largeBucket += trials;

        if(largeBucketSize > smallBucketSize) {
            rebalance();
        }
    }

    private void rebalance() {
        List<Integer> largeKeys = new ArrayList<>(largeBucket.keySet());
        Collections.sort(largeKeys);

        while(largeBucketSize > smallBucketSize) {
            //get the smallest bucket item to move over
            Integer key = largeKeys(0);
            Integer value = largeBucket.get(key);

            //move item from large to small bucket
            largeBucket.remove(key);
            smallBucket.add(key, value);

            //update bucket values
            largeBucketSize -= value;
            smallBucketSize += value; 

            //and the largest item in the small bucket is the new median
            median = key;

            //remove the first key from our large keys list
            largeKeys.remove(0);

            //repeat as necessary
        }
    }

    private int getMedian() {
        return median;
    }
}

What is a better way to get the Median of a Probability Map?

Background

The Problem

1 Answers1