2

I want to efficiently calculate two means of values of a HashMap each time a new key/value pair is inserted.

Suppose we currently have this HashMap<Double, Double>:

3 4
5 6
8 8
1 3
6 8 <- Latest insertion

The latest insertion was the key 6 with value 8.

The first mean to calculate consists of all values which keys are smaller than the inserted one, which is 6.

These are the values 4,6,3 of the keys 3,5,1, so the mean is (4+6+3)/3=4.3...

The second mean is the "opposite", so the mean of all values for all keys greater than 6.

The key 8 with value 1 gives this mean as 8/1=8.

Now, a new key/pair gets inserted:

3 4
5 6
6 8
8 8
1 3
4 9 <- Latest insertion

So again, we need to calculate the mean for all values with keys smaller than 4.

These are the values 4,3 for the keys 3,1, so the "smaller mean" is now (4+3)/2=3.5

The "greater mean" is now (6+8+8)/3=7.3... for the key/value pairs 5/6,6/8,8/8.

A naive implementation might be something like this:

public class CalculateMapMean {

        private double smallerMean = 0.0;
        private double greaterMean = 0.0;

        private HashMap<Double, Double> someMap = new HashMap<Double, Double>();

        public void calculateMeans(double latestInsertedKey) {
            double sumGreater = 0;
            double sumSmaller = 0;
            double sumGreaterCount = 0;
            double sumSmallerCount = 0;
            for (Map.Entry<Double, Double> entry : someMap.entrySet()) {
                double key = entry.getKey();
                double value = entry.getValue();
                if (key > latestInsertedKey) {
                    sumGreater += value;
                    ++sumGreaterCount;
                }
                else if (key < latestInsertedKey) {
                    sumSmaller += value;
                    ++sumSmallerCount;
                }
            }
            if (sumGreaterCount != 0) {
                greaterMean = sumGreater / sumGreaterCount;
            }
            else {
                greaterMean = 0.0;
            }
            if (sumSmallerCount != 0) {
                smallerMean = sumSmaller / sumSmallerCount;
            }
            else {
                smallerMean = 0.0;
            }
        }
    }

The question is if the calculations of the means can be dramatically improved with a TreeMap or another datastrure such that one does not to have iterate over all keys on every insertion.

Is there an elegant way of reusing former calculations?

rafalio
  • 3,928
  • 4
  • 30
  • 33
Juergen
  • 3,489
  • 6
  • 35
  • 59

3 Answers3

1

The only way I can think of to get below O(n) time for every change to the map is by keeping a balanced binary search tree (BBST) with the keys. In every node you need to keep few extra fields

  • the number of nodes in the subtree rooted at that node
  • the sum of the values of all nodes rooted at that node

Rebalancing a BBST after an insert/delete takes O(log n) time. In that same balance operation you can update the count and sum, also in O(log n) time (since you perform O(log n) operations that take O(1) time).

To get the correct means you need to traverse the tree and add the right counts. Let's give a simple example. Suppose I have the following 7 key-value pairs. I hope you can imagine how the corresponding BBST would look.

(3, 5) (4, 3) (7, 1) (8, 4) (11, 3) (12, 1)(13, 3)

In the root - (8, 4) - the total count and sum is stored: [7, 20]. In the root of the left subtree - (4, 3) - the total count and sum of that subtree is stored: [3, 9]. I draw these extra values now as a function of depth in the tree:

[         7, 20        ]
[   3, 9   ][   3, 7   ]
[1, 5][1, 1][1, 3][1, 3]

Suppose I add a new tuple with key 10 now. I start traversing the tree at the root. Because 8 < 10, I don't need to traverse the left subtree: all keys in that subtree are smaller than 10, so we can use the cached values [3, 9]. For the right subtree, we need to recurse, because some keys might be smaller than 10 and some might be larger. We don't have to traverse the right subtree there, because 12 > 10, so we can use [1, 3] directly.

In every layer of the tree, we can ignore one branch and recurse on the other. Therefore, finding the total value and count for keys smaller than the last inserted key and for keys larger than the last inserted key takes O(log n) time as well.

Vincent van der Weele
  • 12,927
  • 1
  • 33
  • 61
  • @downvoter: care to explain what's wrong? If there is a conceptual problem with this approach it's not gonna help anyone if it remains unsaid. – Vincent van der Weele May 16 '14 at 14:44
  • Thanks for this suggestion. I thought of something similiar and was curious of an efficient solution exists out of the box. – Juergen May 16 '14 at 16:19
  • @Juergen of course I cannot know for sure, but I would be highly surprised if an out-of-the-box data structure for this problem exists. You could of course still find a BBST implementation and modify that, because implementing that correctly is definitely not trivial! – Vincent van der Weele May 16 '14 at 18:16
1

Yes, a TreeSet would help.

Suppose an element with e=(k,v) comes in. If you keep your tuples in a treeset, you could use tailSet(e) to get all elements with value greater than v. Similarly for headSet(e). Then, you can normally find the average of the numbers in those sets, for a cost of O(n*log(n)), and insert the new tuple with cost O(log(n)).

I believe you could speed this up even more by using an balanced binary tree which in addition to the key and the value, keeps track of the number of elements with lower keys, and their average. Similarly for elements of the right branch with higher values. Then, when a new element comes in, you binary search for its insertion point, and keep track of the averages you encounter, constructing the average of higher and lower numbers appropriately. I think it would be tricky to implement the balanced bit as everything would move around and you would have to ensure the integrity of the average labels.

That said, I recommend you just use a TreeSet.

rafalio
  • 3,928
  • 4
  • 30
  • 33
  • Isn't enumerating a `TreeSet` linear time? Maintaining the average labels (more likely, a sum label and a count label) isn't too hard if you implement the binary tree in the modular fashion proposed by [Austern et al.](http://www.stroustrup.com/tree-appendix.pdf). – David Eisenstat May 16 '14 at 15:03
  • I think I responded too quickly, you are correct, this does not help at all, disregard the first part of the answer since we will look at all elements of the treeset anyways. I think the best way is then to use a binary tree like I explained above. – rafalio May 16 '14 at 15:08
-1

You can store these values inside your implementation, something like:

public class MyHashMap extends HashMap<Double, Double> {
    private double sum = 0;

    @Override
    public void put(Double key, Double value) {
        super (key, value);
        if (containsKey(key)) {
            sum -= get(key);
        }
        sum += value;
        super(key, value);
    }

    @Override
    public void putAll(Map<? extends Double, ? extends Double> map) {
        for (Map.Entry<? extends Double, ? extends Double> entry: map) {
            put(entry.getKey(), entry.getValue());
        }
    }

    @Override
    public void remove(Object key) {
        Double value = get(key);
        if (value != null)
            sum -= value;
        super(key);
    }

    public double getMean() {
        return sum / size();
    }
}
Dmitry Ginzburg
  • 7,391
  • 2
  • 37
  • 48
  • This is incorrect. The OP wants *two* means: the mean for all keys smaller than the last inserted key and the mean for all keys larger than the last inserted key. There is no way to achieve this by keeping a single sum. – Vincent van der Weele May 16 '14 at 14:11
  • But this is the mean of all values except the inserted value, isn't it? I need two means, depending on the key. – Juergen May 16 '14 at 14:11