7

I have a HashMap with 8 million Point2D's mapping to a LinkedList.

private Map<Point2D,LinkedList<RoadEdge>> adjacencyList;

Everything works as it should, but it takes very long time for me to get data from the HashMap. Is there any alternative, that I can use to optimize getting data out?

I am willing to compromise the time it takes to put() in favor of the time it takes to get().

Hedam
  • 2,209
  • 27
  • 53
  • 4
    Checking for collisions is always a good starting point when your hashmap is slow. The difference between a poorly and well implemented `hashCode()` method can be enormous. – biziclop May 12 '17 at 10:18
  • 2
    probably not helpful, but I'd move to a database. :) – Steve Smith May 12 '17 at 10:20
  • This here *might* be helpful: http://www.javaspecialists.eu/archive/Issue235.html – GhostCat May 12 '17 at 10:20
  • 1
    Looking things up in a `HashMap` should be fast. Possible reasons: `Point2D` may not have a good `hashCode` method so that you get hash collisions, or not the map lookup is slow, but something else, like traversing a long `LinkedList`. – Jesper May 12 '17 at 10:21
  • 2
    You might consider using a data structure specifically designed for spatial lookups. For instance, if your "keys" are fixed, you might use a K-d tree, or similar. – Andy Turner May 12 '17 at 10:23
  • Database is not possible at the moment. All the points in the hashmap is very close (within decimals) - could that cause a lot of collisions? – Hedam May 12 '17 at 10:25
  • 1
    What is the actual lookup that you're doing here? Are you looking up exact points, or are you doing nearest-neighbour lookups? – Andy Turner May 12 '17 at 10:25
  • Actual lookups. I have points and I want to find the edges related to the points for a A* search. – Hedam May 12 '17 at 10:28
  • 1
    In A*, a LRU cache should help – Thijs Steel May 12 '17 at 10:29
  • 4
    I would not blame the `HashMap` in the first place. See [When to use LinkedList over ArrayList?](http://stackoverflow.com/questions/322715/when-to-use-linkedlist-over-arraylist)… – Holger May 12 '17 at 10:30
  • I think it wouldn't matter changing to ArrayList, as we are iterating through all of them. – Hedam May 12 '17 at 10:34
  • 6
    @Hedam `ArrayList` is faster to iterate than a `LinkedList`. – Andy Turner May 12 '17 at 10:35
  • How about using some cache, like EhCache? http://www.ehcache.org/documentation/2.7/get-started/introduction.html – adi May 12 '17 at 10:41
  • Maybe a splay tree type structure would be useful too. – biziclop May 12 '17 at 11:52
  • Please show `Point2D.hashCode` implementation – fps May 12 '17 at 19:26
  • @FedericoPeraltaSchaffner Default implementation – Hedam May 13 '17 at 12:33

3 Answers3

8

First thing is to check the distribution of hashcode. First check that, but with a minor change. The hashcode of a Key inside a map is re-hashed internally via:

static final int hash(Object key) {
    int h;
    return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}

So you should really re-hash your hashcode with this function and than later check how it is distributed.

Generally under a good distribution, your TreeNodes (this is how a bucket is set-uo for many entries) are very fast to be found. For a bucket that would have Integer.MAX_VALUE entries it would take at most 32 steps to find it. That happens because the bin is transformed into a perfectly balanced red-black tree.

Searching inside a Map in general is O(1). Searching a bin with TreeNodes is O(logn). And searching inside a LinkedList is O(n) - much worse then the previous ones.

But that is the time it takes to find a single Entry in the map. If you further need to get the element from the LinkedList that means additional time(worse then finding the entry in the map)

Also for so many entries it is critical to specify the loadFactor and initialCapacity initially (at least initialCapacity), before putting the elements into the map. This is because of re-hashing and moving of elements for another bucket (potentially). If you first put them all and than just try to find them, without altering the Map - this will not be a problem when searching...

But generally unless you measure and measure correctly this might not be the problem you are facing. You might be looking into the wrong direction.

Eugene
  • 117,005
  • 15
  • 201
  • 306
  • 3
    First thing is to check is where the actual bottleneck is, e.g. using a profiler. If I see `LinkedList` in the declaration, I’m not willing to accept the statement that the *hashmap* is the problem, without any prove… – Holger May 12 '17 at 10:32
3

First of all, check the hash function of your key (Point2D). It should be well distributed or you'll have a lot of collisions. See Eugene's answer for an explanation on the beauty that is the HashMap.

Secondly, depending on how you want to access the data (by which I mean you try to get the same object a lot), you can use a cache, but do this only if you can't change the hashCode.

Here is my implementation of a LRU cache from another project. It might speed some things up (it's not wonderful code and does unchecked casting, but should work in most cases).

package util;

import java.util.HashMap;
import java.util.LinkedHashMap;
import java.util.Map;

public class CachedHashMap<K, V> extends HashMap<K, V>{

private static final long serialVersionUID = 36215319740415714L;

private LRUCache<K, V> cache;
private LRUCache<K, Boolean> containsCache;

public CachedHashMap(int cacheSize){
    containsCache = new LRUCache<>(2*cacheSize);
    cache = new LRUCache<>(cacheSize);
}

@Override
public V put(K key, V value){

    cache.put(key, value);
    containsCache.put(key, true);
    return super.put(key, value);

}

@Override
public boolean containsKey(Object key){

    if(containsCache.containsKey(key)){
        return containsCache.get(key);
    }

    boolean contains = super.containsKey(key);

    containsCache.put((K) key, contains);

    return contains;

}

@Override
public V remove(Object key){
    containsCache.remove(key);
    cache.remove(key);
    return super.remove(key);
}

@Override
public V get(Object key){
    if(containsCache.containsKey(key)){
        if(!containsCache.get(key))
            return null;
    }

    V value = cache.get(key);
    if(value != null)
        return value;

    value = super.get(key);

    K keyCast = (K) key;
    if(value != null){
        containsCache.put(keyCast, true);
        cache.put(keyCast, value);
    }else{
        containsCache.put(keyCast, false);
    }

    return value;
}

class LRUCache<A, B> extends LinkedHashMap<A, B> {
private static final long serialVersionUID = -5958373482978477783L;

private int cacheSize;

  public LRUCache(int cacheSize) {
    super(16, 0.75f, true);
    this.cacheSize = cacheSize;
  }

  protected boolean removeEldestEntry(Map.Entry<A, B> eldest) {
    return size() >= cacheSize;
  }
}

}
Peter O.
  • 32,158
  • 14
  • 82
  • 96
Thijs Steel
  • 1,190
  • 7
  • 16
  • 1
    And if you search for a lot of non-existent points, a Bloom filter can be useful too. – biziclop May 12 '17 at 10:22
  • I can tell you that the difference betweens all 8 million points are in the decimals. Will that affect the default hashCode? – Hedam May 12 '17 at 10:24
  • 1
    won't the xor mean that if x and y are equal, the hashcode will always be 0? – Thijs Steel May 12 '17 at 10:26
  • The coords are always 6.xxxx and -55.yyyy. If that produces the same hashCode I really see the issue. – Hedam May 12 '17 at 10:30
  • The cached HashMap made the average case go down to almost nothing. – Hedam May 12 '17 at 11:56
  • 2
    @Hedam of course it did, you have a cache now. But still you are hiding the real problem and that's ok if you are ok with that – Eugene May 12 '17 at 12:09
  • @Hedam That's great. But Eugene really is right. If speed becomes a problem again (like it did in the project where i used this cache). Cache misses will still be quite slow. As a hash function I suggest not including the exponent part of the double in your hash, since they are always the same. Then, instead of xor, use standard prime number approuches. – Thijs Steel May 12 '17 at 12:34
  • 1
    Why are you doing this double lookup in `if(containsCache.containsKey(key)){ return containsCache.get(key); }`? Your `containsCache` only contains the present keys or, in other words, always maps to `true`, so why not simply `if(containsCache.containsKey(key)){ return true; }`? Well, or why not using a `Set` in the first place? – Holger May 12 '17 at 14:01
  • 2
    @Hedam `difference between 8 million points are in the decimals`, but than later you give an example where the precision is just 4 digits after the dot. 4 digits can not provide 8 million entries... – Eugene May 12 '17 at 14:49
  • The containscashe also contains false, for when the map doesn't have the given key – Thijs Steel May 12 '17 at 15:12
  • @Eugene, that was an example. I use all significant digits in the float Point2D. – Hedam May 13 '17 at 12:56
2

To prove a point, here is a jmh test that searches for an entry in maps that contain 10, 10_000 and 10_000_000 entries. You can see from the results that the search is constant - O(1). The problem is elsewhere in your code. Even if you don't understand the code the results are just readable numbers(see at the end).

import java.math.BigDecimal;
import java.math.RoundingMode;
import java.util.HashMap;
import java.util.Map;
import java.util.concurrent.ThreadLocalRandom;
import java.util.concurrent.TimeUnit;

import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Fork;
import org.openjdk.jmh.annotations.Level;
import org.openjdk.jmh.annotations.Measurement;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.annotations.TearDown;
import org.openjdk.jmh.annotations.Warmup;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;

import javafx.geometry.Point2D;

@BenchmarkMode(org.openjdk.jmh.annotations.Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 2, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 2, timeUnit = TimeUnit.SECONDS)
public class TwoMapsTest {
    public static void main(String[] args) throws RunnerException {

        Options opt = new OptionsBuilder().include(EightMillionsTest.class.getSimpleName())
                .jvmArgs("-ea", "-Xms10g", "-Xmx10g")
                .shouldFailOnError(true)
                .build();
        new Runner(opt).run();
    }

    // the check bellow assert map.size() == numberOfElements; can absolutely fail
    // fingers crossed that it does not.

    @State(Scope.Thread)
    public static class Map_10 {

        int numberOfElements = 10;

        public Map<Point2D, Integer> map = new HashMap<>();

        public Point2D mightBePresent = null;

        public Point2D isPresent = null;

        @Setup(Level.Iteration)
        public void setUp() {
            int randomInsideHowManyBoundry = ThreadLocalRandom.current().nextInt(numberOfElements);
            for (int i = 0; i < numberOfElements; ++i) {

                double[] d = generateTwoPoints(-3.0, 3.9999, 4);
                Point2D p = new Point2D(d[0], d[1]);

                if (isPresent == null && i == randomInsideHowManyBoundry) {
                    isPresent = new Point2D(d[0], d[1]);
                }
                map.put(p, i);
            }

            assert map.containsKey(isPresent);
            assert map.size() == numberOfElements;
        }

        @TearDown(Level.Iteration)
        public void tearDown() {
            map.clear();
        }
    }

    @State(Scope.Thread)
    public static class Map_10_000 {

        int numberOfElements = 10_000;

        public Map<Point2D, Integer> map = new HashMap<>();

        public Point2D mightBePresent = null;

        public Point2D isPresent = null;

        @Setup(Level.Iteration)
        public void setUp() {
            int randomInsideHowManyBoundry = ThreadLocalRandom.current().nextInt(numberOfElements);
            for (int i = 0; i < numberOfElements; ++i) {

                double[] d = generateTwoPoints(-3.0, 3.9999, 4);
                Point2D p = new Point2D(d[0], d[1]);

                if (isPresent == null && i == randomInsideHowManyBoundry) {
                    isPresent = new Point2D(d[0], d[1]);
                }
                map.put(p, i);
            }

            assert map.containsKey(isPresent);
            assert map.size() == numberOfElements;
        }

        @TearDown(Level.Iteration)
        public void tearDown() {
            map.clear();
        }
    }

    @State(Scope.Thread)
    public static class Map_10_000_000 {

        int numberOfElements = 10_000_000;

        public Map<Point2D, Integer> map = new HashMap<>();

        public Point2D mightBePresent = null;

        public Point2D isPresent = null;

        @Setup(Level.Iteration)
        public void setUp() {
            int randomInsideHowManyBoundry = ThreadLocalRandom.current().nextInt(10_000_000);
            for (int i = 0; i < 10_000_000; ++i) {

                double[] d = generateTwoPoints(-3.0, 3.9999, 4);
                Point2D p = new Point2D(d[0], d[1]);

                if (isPresent == null && i == randomInsideHowManyBoundry) {
                    isPresent = new Point2D(d[0], d[1]);
                }
                map.put(p, i);
            }

            assert map.containsKey(isPresent);
            assert map.size() == numberOfElements;
        }

        @TearDown(Level.Iteration)
        public void tearDown() {
            map.clear();
        }
    }

    @Fork(1)
    @Benchmark
    public int mightBePresentMap_10(Map_10 map) {
        Integer x = map.map.get(map.mightBePresent);
        return x == null ? -1 : x;
    }

    @Fork(1)
    @Benchmark
    public int isPresentMap_10(Map_10 map) {
        Integer x = map.map.get(map.isPresent);
        return x == null ? -1 : x;
    }

    @Fork(1)
    @Benchmark
    public int mightBePresentMap_10_000(Map_10_000 map) {
        Integer x = map.map.get(map.mightBePresent);
        return x == null ? -1 : x;
    }

    @Fork(1)
    @Benchmark
    public int isPresentMap_10_000(Map_10_000 map) {
        Integer x = map.map.get(map.isPresent);
        return x == null ? -1 : x;
    }

    @Fork(1)
    @Benchmark
    public int mightBePresentMap_10_000_000(Map_10_000_000 map) {
        Integer x = map.map.get(map.mightBePresent);
        return x == null ? -1 : x;
    }

    @Fork(1)
    @Benchmark
    public int isPresentMap_10_000_000(Map_10_000_000 map) {
        Integer x = map.map.get(map.isPresent);
        return x == null ? -1 : x;
    }

    private static double[] generateTwoPoints(double upperBound, double lowerBound, int precision) {
        double x = ThreadLocalRandom.current().nextDouble(lowerBound, upperBound);
        x = BigDecimal.valueOf(x).setScale(precision, RoundingMode.HALF_UP).doubleValue();

        double y = ThreadLocalRandom.current().nextDouble(lowerBound, upperBound);
        y = BigDecimal.valueOf(x).setScale(precision, RoundingMode.HALF_UP).doubleValue();

        return new double[] { x, y };
    }
}

And the actual results:

  Benchmark                                          (howManyEntries)  Mode  Cnt   Score   Error  Units
hashmap8Millions.EightMillionsTest.isPresent                      1  avgt    5   8.787 ± 0.259  ns/op
hashmap8Millions.EightMillionsTest.isPresent                     10  avgt    5   9.988 ± 0.283  ns/op
hashmap8Millions.EightMillionsTest.isPresent                    256  avgt    5   9.146 ± 2.081  ns/op
hashmap8Millions.EightMillionsTest.isPresent                  10000  avgt    5   8.871 ± 0.574  ns/op
hashmap8Millions.EightMillionsTest.isPresent                1000000  avgt    5   8.894 ± 0.676  ns/op
hashmap8Millions.EightMillionsTest.isPresent               10000000  avgt    5  10.884 ± 5.623  ns/op
hashmap8Millions.EightMillionsTest.mightBePresent                 1  avgt    5   4.607 ± 0.175  ns/op
hashmap8Millions.EightMillionsTest.mightBePresent                10  avgt    5   4.713 ± 0.944  ns/op
hashmap8Millions.EightMillionsTest.mightBePresent               256  avgt    5   5.283 ± 0.511  ns/op
hashmap8Millions.EightMillionsTest.mightBePresent             10000  avgt    5   8.944 ± 0.144  ns/op
hashmap8Millions.EightMillionsTest.mightBePresent           1000000  avgt    5  10.256 ± 0.121  ns/op
hashmap8Millions.EightMillionsTest.mightBePresent          10000000  avgt    5   8.764 ± 0.504  ns/op

Notice that these are nano-seconds, this is by far not slow...

Eugene
  • 117,005
  • 15
  • 201
  • 306