Severe Java performance drop after changing ID strings

Question

Summary

We have recently changed our String-based ID schema in a complex retrieval engine and observed a severe performance drop. In essence, we changed the IDs from XXX-00000001 to X384840564 (see below for details on the ID schema) and suffer from almost doubled runtimes. Choosing a different string hash function solved the problem, but we still lack a good explanation. Thus, our questions are:

Why do we see such a strong performance drop when changing from the old to the new ID schema?
Why does our solution of using the “parent hash” actually work?

To approach the problem, we hereafter provide (a) detailed information about the ID schemata and hash functions used, (b) a minimal working example in Java that highlights the performance defect, and (c) our performance results and observations.

(Despite the lengthy description, we have already massively reduced the code example to 4 performance critical lines – see phase 2 in the listing.)

(a) Old and new ID schema; hash functions

Our ID objects consist of a parent ID object (string of 16 characters in [A-Z0-9]) and a child ID string. The same parent ID string is on average used by 1–10 child IDs. The old child IDs had a three-letter prefix, a dash, and a zero-padded running index number of length 8, for example, XXX-00000001 (12 characters in total; X may be any letter [A-Z]). The new child IDs have one letter and 9 non-consecutive digits, for example, X384840564 (10 characters in total, X may be any letter [A-Z]). An obvious difference is that the old child ID strings are often recurring (i.e., the string ABC-00000002 occurs with multiple different parent IDs, as the running index typically starts with 1), while the new child IDs with their arbitrary digit combinations typically occur only a few times or even only with a single parent ID.

Since the ID objects are put into HashSets and HashMaps, the choice of a hash function seems crucial. Currently, the system uses the standard string hash for the parent IDs. For the child IDs, we used to XOR the string hashes of parent and child ID – called XOR hash henceforth. In theory, this should distribute different child IDs quite well. As a variant, we experimented with using only the string hash of the parent ID as the hash code of the child ID – called parent hash henceforth. That is, all child IDs sharing the same parent ID share the same hash. In theory, the parent hash could be suboptimal, as all children sharing the same parent ID end up in the same bucket, while the XOR hash should yield a better data distribution.

(b) Minimal working example

Please refer to the following listing (explanation below):

package mwe;

import java.io.IOException;
import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.HashSet;
import java.util.LinkedHashSet;
import java.util.List;
import java.util.Random;
import java.util.Set;

public class Main {

   private static final Random RANDOM = new Random(42);
   private static final String DIGITS = "0123456789";
   private static final String ALPHABET = "ABCDEFGHIJKLMNOPQRSTUVWXYZ" + DIGITS;
   private static final int NUM_IDS = 5_000_000;
   private static final int MAX_CHILDREN = 5;
   private static final int REPETITIONS = 5;
   private static final boolean RUN_ID_OLD = true; // e.g., 8IBKMAO2T1ORICNZ__XXX-00000002
   private static final boolean RUN_ID_NEW = false; // e.g., 6TEG9R5JP1KHJN55__X580104176
   private static final boolean USE_PARENT_HASH = false;
   private static final boolean SHUFFLE_SET = false;

   private abstract static class BaseID {

      protected int hash;

      public abstract BaseID getParentID();

      @Override
      public int hashCode() {
         return this.hash;
      }

   }

   private static class ParentID extends BaseID {

      private final String id;

      public ParentID(final String id) {
         this.id = id;
         this.hash = id.hashCode();
      }

      @Override
      public BaseID getParentID() {
         return null;
      }

      @Override
      public boolean equals(final Object obj) {
         if (this == obj) {
            return true;
         }

         if (obj instanceof ParentID) {
            final ParentID o = (ParentID) obj;
            return this.id.equals(o.id);
         }
         return false;
      }

      @Override
      public String toString() {
         return this.id;
      }

   }

   private static class ChildID extends BaseID {

      private final String id;
      private final BaseID parent;

      public ChildID(final String id, final BaseID parent) {
         this.id = id;
         this.parent = parent;
         // Initialize the hash code of the child ID:
         if (USE_PARENT_HASH) {
            // Only use the parent hash (i.e., all children have the same hash).
            this.hash = parent.hashCode();
         } else {
            // XOR parent and child hash.
            this.hash = parent.hashCode() ^ id.hashCode();
         }
      }

      @Override
      public BaseID getParentID() {
         return this.parent;
      }

      @Override
      public boolean equals(final Object obj) {
         if (this == obj) {
            return true;
         }
         if (this.hash != obj.hashCode()) {
            return false;
         }

         if (obj instanceof ChildID) {
            final ChildID o = (ChildID) obj;
            final BaseID oParent = o.getParentID();
            if (this.parent == null && oParent != null) {
               return false;
            }
            if (this.parent != null && oParent == null) {
               return false;
            }
            if (this.parent == null || !this.parent.equals(oParent)) {
               return false;
            }
            return this.id.equals(o.id);
         }
         return false;
      }

      @Override
      public String toString() {
         return this.parent.toString() + "__" + this.id;
      }

   }

   public static void run(final int repetitions, final boolean useVariant2IDs) throws IOException {
      for (int i = 0; i < repetitions; i++) {
         System.gc(); // Force memory reset for the next repetition.

         // -- PHASE 1: CREATE DATA --------------------------------------------------------------------------------
         // Fill a set of several millions random IDs. Each ID is a child ID with a reference to its parent ID.
         // Each parent ID has between 1 and MAX_CHILDREN children.
         Set<BaseID> ids = new HashSet<>(NUM_IDS);
         for (int parentIDIdx = 0; parentIDIdx < NUM_IDS; parentIDIdx++) {
            // Generate parent ID: 16 random characters.
            final StringBuilder parentID = new StringBuilder();
            for (int k = 0; k < 16; k++) {
               parentID.append(ALPHABET.charAt(RANDOM.nextInt(ALPHABET.length())));
            }
            // Generate between 1 and MAX_CHILDREN child IDs.
            final int childIDCount = RANDOM.nextInt(MAX_CHILDREN) + 1;
            for (int childIDIdx = 0; childIDIdx < childIDCount; childIDIdx++) {
               final StringBuilder childID = new StringBuilder();
               if (useVariant2IDs) {
                  // Variant 2: Child ID = letter X plus 9 random digits.
                  childID.append("X");
                  for (int k = 0; k < 9; k++) {
                     childID.append(DIGITS.charAt(RANDOM.nextInt(DIGITS.length())));
                  }
               } else {
                  // Variant 1: Child ID = XXX- plus zero-padded index of length 8.
                  childID.append("XXX-").append(String.format("%08d", childIDIdx + 1));
               }
               final BaseID id = new ChildID(childID.toString(), new ParentID(parentID.toString()));
               ids.add(id);
            }
         }
         System.out.print(ids.iterator().next().toString());
         System.out.flush();
         if (SHUFFLE_SET) {
            final List<BaseID> list = new ArrayList<>(ids);
            Collections.shuffle(list);
            ids = new LinkedHashSet<>(list);
         }
         System.gc(); // Clean up temp data before starting the timer.

         // -- PHASE 2: INDEX DATA ---------------------------------------------------------------------------------
         // Iterate over the ID set and fill a map indexed by parent IDs. The map values are irrelevant here, so
         // use empty objects.
         final long timer = System.currentTimeMillis();
         final HashMap<BaseID, Object> map = new HashMap<>();
         for (final BaseID id : ids) {
            map.put(id.getParentID(), new Object());
         }
         System.out.println("\t" + (System.currentTimeMillis() - timer));

         // Ensure that map and IDs are not GC:ed before the timer stops.
         if (map.get(new ParentID("_do_not_gc")) == null) {
            map.put(new ParentID("_do_not_gc"), new Object());
         }
         ids.add(new ParentID("_do_not_gc"));
      }
   }

   public static void main(final String[] args) throws IOException {
      if (RUN_ID_OLD) {
         run(REPETITIONS, false);
      }
      if (RUN_ID_NEW) {
         run(REPETITIONS, true);
      }
   }

}

In essence, the program first generates a HashSet of IDs and then indexes these IDs by their parent ID in a HashMap. In detail:

The first phase (PHASE 1) generates 5 million parent IDs, each with 1 to 10 child IDs using either the old (e.g., XXX-00000001) or the new ID schema (e.g., X384840564) and one of the two hash functions. The generated child IDs are collected in a HashSet. We explicitly create new parent ID objects for each child ID to match the functionality of the original system. For experimentation, the IDs can optionally be shuffled in a LinkedHashSet to distort the hash-based ordering (cf. boolean SHUFFLE_SET).

The second phase (PHASE 2) simulates the performance-critical path. It reads all IDs (child IDs with their parents) from the HashSet and puts them into a HashMap with the parent IDs as keys (i.e., aggregate IDs by parent).

Note: The actual retrieval system has a more complex logic, such as reading IDs from multiple sets and merging child IDs as the map entry’s values, but it turned out that none of these steps was responsible for the strong performance gap in question.

The remaining lines try to control for the GC, such that the data structures are not GC:ed too early. We’ve tried different alternatives for controlling the GC, but the results seemed pretty stable overall.

When running the program, the constants RUN_ID_OLD and RUN_ID_NEW activate the old and the new ID schema, respectively (best activate only one at a time). USE_PARENT_HASH switches between the XOR hash (false) and the parent hash (true). SHUFFLE_SET distorts the item order in the ID set. All other constants can remain as they are.

(c) Results

All results here are based on a typical Windows desktop with OpenJDK 11. We also tested Oracle JDK 8 and a Linux machine, but observed similar effects in all cases. For the following figure, we tested each configuration in independent runs, whereas each run repeats the timing 5 times. To avoid outliers, we report the median of the repetitions. Note, however, that the timings of the repetitions do not differ much. The performance is measured in milliseconds.

Observations:

Using XOR hash yields a substantial performance drop in the HashSet setting when switching to the new ID schema. This hash function seems suboptimal, but we lack a good explanation.
Using the parent hash function speeds up the process regardless of the ID schema. We speculate that the internal HashSet order is beneficial, since the resulting HashMap will build up the same order (because ID.hash = ID.parent.hash). Interestingly, this effect can also be observed if the HashSet is split into, say, 50 parts, each holding a random partition of the full HashSet. This leaves us puzzled.
The entire process seems to be heavily dependent of the reading order in the for loop of the second phase (i.e., the internal order of the HashSet). If we distort the order in the shuffled LinkHashSet, we end up in a worst-case scenario, regardless of the ID schema.
In a separate experiment, we also diagnosed the number of collisions when filling the HashMap, but could not find obvious differences when changing the ID schema.

Who can shed more light on explaining these results?

Update

The image below shows some profiling results (using VisualVM) for the non-shuffled runs. Indent indicates nested calls. All percentage values are relative to the phase 2 timing (100%).

An obious difference seems to be HashMap.putVal's self time. There was no obvious difference for treeifying buckets.

Did you profile the two versions to see where most of that time is spent? You seem to have done most other relevant things, that seems to be the logical next step. — Joachim Sauer, Nov 17 '20 at 15:59
Side-note: You might want to check out [How do I write a correct micro-benchmark](https://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java), as you commit several common benchmarking sins in your code. I *think* the actual performance difference between the variants is big enough that the errors don't invalidate your results, but you might want to take a look at that anyway. — Joachim Sauer, Nov 17 '20 at 16:01
In the sorted case with many colliding hashes (i.e. the parent-hash variant) you insert many hash-colliding object one after each other, which makes it very likely (almost guaranteed) that the affected bins in the underlying `HashMap` implementing the `HashSet` will be treeified. This is likely to reduce the negative performance impact you get from that many collisions. In the randomized case that treeifying will happen much later (because you randomize when you enter fields and you will suffer the worse performance of the non-treeified case for longer. That's just one possible reason ... — Joachim Sauer, Nov 17 '20 at 16:14
@JoachimSauer: Many thanks for your response. I've added some profiling results to the post. For benchmarking, I've used JMH before, but set it aside here, as the difference is quite prominent and we've confirmed it in the full code base (in most other cases, I totally agree that a more thorough benchmarking is necessary to tell the differences). The treeifying aspects may be a good hint towards the solution. The profiling, however, did not directly show evidence for this, but I will have to look closer into it. — sfcme, Nov 18 '20 at 09:15
It looks like the hash code of child id objects are never used in phase 2, unless I'm missing something. You only ever use `null` or a `ParentID` as the key to the hashmap. Which would make it weird that there is an actual performance difference between the hash variants which only apply to child id objects, as far as I can see. — Joachim Sauer, Nov 18 '20 at 09:42
@JoachimSauer: Exactly. The only difference IMHO is the order of the HashSet used in the for loop, which is likely based on the internal (i.e., child-hash-based) order of the set elements. Why this should yield a higher runtime for phase 2 remains unclear. Especially when considering observation 2 (same effect when partitioning the HashSet in 50 parts). — sfcme, Nov 18 '20 at 10:15
If reading from the set is the part that changes performance then I'd try removing the "add to hashmap" part and replace it with a void (in JMH speak). I'd also check if *maybe* data locality plays a role here, but that's rather hard to investigate (or put differently: I don't know how I would investigate this, other might). — Joachim Sauer, Nov 18 '20 at 10:27