Summary
We have recently changed our String-based ID schema in a complex retrieval engine and observed a severe performance drop. In essence, we changed the IDs from XXX-00000001
to X384840564
(see below for details on the ID schema) and suffer from almost doubled runtimes. Choosing a different string hash function solved the problem, but we still lack a good explanation. Thus, our questions are:
- Why do we see such a strong performance drop when changing from the old to the new ID schema?
- Why does our solution of using the “parent hash” actually work?
To approach the problem, we hereafter provide (a) detailed information about the ID schemata and hash functions used, (b) a minimal working example in Java that highlights the performance defect, and (c) our performance results and observations.
(Despite the lengthy description, we have already massively reduced the code example to 4 performance critical lines – see phase 2 in the listing.)
(a) Old and new ID schema; hash functions
Our ID objects consist of a parent ID object (string of 16 characters in [A-Z0-9]) and a child ID string. The same parent ID string is on average used by 1–10 child IDs. The old child IDs had a three-letter prefix, a dash, and a zero-padded running index number of length 8, for example, XXX-00000001
(12 characters in total; X may be any letter [A-Z]). The new child IDs have one letter and 9 non-consecutive digits, for example, X384840564
(10 characters in total, X may be any letter [A-Z]). An obvious difference is that the old child ID strings are often recurring (i.e., the string ABC-00000002
occurs with multiple different parent IDs, as the running index typically starts with 1), while the new child IDs with their arbitrary digit combinations typically occur only a few times or even only with a single parent ID.
Since the ID objects are put into HashSets and HashMaps, the choice of a hash function seems crucial. Currently, the system uses the standard string hash for the parent IDs. For the child IDs, we used to XOR the string hashes of parent and child ID – called XOR hash henceforth. In theory, this should distribute different child IDs quite well. As a variant, we experimented with using only the string hash of the parent ID as the hash code of the child ID – called parent hash henceforth. That is, all child IDs sharing the same parent ID share the same hash. In theory, the parent hash could be suboptimal, as all children sharing the same parent ID end up in the same bucket, while the XOR hash should yield a better data distribution.
(b) Minimal working example
Please refer to the following listing (explanation below):
package mwe;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.HashSet;
import java.util.LinkedHashSet;
import java.util.List;
import java.util.Random;
import java.util.Set;
public class Main {
private static final Random RANDOM = new Random(42);
private static final String DIGITS = "0123456789";
private static final String ALPHABET = "ABCDEFGHIJKLMNOPQRSTUVWXYZ" + DIGITS;
private static final int NUM_IDS = 5_000_000;
private static final int MAX_CHILDREN = 5;
private static final int REPETITIONS = 5;
private static final boolean RUN_ID_OLD = true; // e.g., 8IBKMAO2T1ORICNZ__XXX-00000002
private static final boolean RUN_ID_NEW = false; // e.g., 6TEG9R5JP1KHJN55__X580104176
private static final boolean USE_PARENT_HASH = false;
private static final boolean SHUFFLE_SET = false;
private abstract static class BaseID {
protected int hash;
public abstract BaseID getParentID();
@Override
public int hashCode() {
return this.hash;
}
}
private static class ParentID extends BaseID {
private final String id;
public ParentID(final String id) {
this.id = id;
this.hash = id.hashCode();
}
@Override
public BaseID getParentID() {
return null;
}
@Override
public boolean equals(final Object obj) {
if (this == obj) {
return true;
}
if (obj instanceof ParentID) {
final ParentID o = (ParentID) obj;
return this.id.equals(o.id);
}
return false;
}
@Override
public String toString() {
return this.id;
}
}
private static class ChildID extends BaseID {
private final String id;
private final BaseID parent;
public ChildID(final String id, final BaseID parent) {
this.id = id;
this.parent = parent;
// Initialize the hash code of the child ID:
if (USE_PARENT_HASH) {
// Only use the parent hash (i.e., all children have the same hash).
this.hash = parent.hashCode();
} else {
// XOR parent and child hash.
this.hash = parent.hashCode() ^ id.hashCode();
}
}
@Override
public BaseID getParentID() {
return this.parent;
}
@Override
public boolean equals(final Object obj) {
if (this == obj) {
return true;
}
if (this.hash != obj.hashCode()) {
return false;
}
if (obj instanceof ChildID) {
final ChildID o = (ChildID) obj;
final BaseID oParent = o.getParentID();
if (this.parent == null && oParent != null) {
return false;
}
if (this.parent != null && oParent == null) {
return false;
}
if (this.parent == null || !this.parent.equals(oParent)) {
return false;
}
return this.id.equals(o.id);
}
return false;
}
@Override
public String toString() {
return this.parent.toString() + "__" + this.id;
}
}
public static void run(final int repetitions, final boolean useVariant2IDs) throws IOException {
for (int i = 0; i < repetitions; i++) {
System.gc(); // Force memory reset for the next repetition.
// -- PHASE 1: CREATE DATA --------------------------------------------------------------------------------
// Fill a set of several millions random IDs. Each ID is a child ID with a reference to its parent ID.
// Each parent ID has between 1 and MAX_CHILDREN children.
Set<BaseID> ids = new HashSet<>(NUM_IDS);
for (int parentIDIdx = 0; parentIDIdx < NUM_IDS; parentIDIdx++) {
// Generate parent ID: 16 random characters.
final StringBuilder parentID = new StringBuilder();
for (int k = 0; k < 16; k++) {
parentID.append(ALPHABET.charAt(RANDOM.nextInt(ALPHABET.length())));
}
// Generate between 1 and MAX_CHILDREN child IDs.
final int childIDCount = RANDOM.nextInt(MAX_CHILDREN) + 1;
for (int childIDIdx = 0; childIDIdx < childIDCount; childIDIdx++) {
final StringBuilder childID = new StringBuilder();
if (useVariant2IDs) {
// Variant 2: Child ID = letter X plus 9 random digits.
childID.append("X");
for (int k = 0; k < 9; k++) {
childID.append(DIGITS.charAt(RANDOM.nextInt(DIGITS.length())));
}
} else {
// Variant 1: Child ID = XXX- plus zero-padded index of length 8.
childID.append("XXX-").append(String.format("%08d", childIDIdx + 1));
}
final BaseID id = new ChildID(childID.toString(), new ParentID(parentID.toString()));
ids.add(id);
}
}
System.out.print(ids.iterator().next().toString());
System.out.flush();
if (SHUFFLE_SET) {
final List<BaseID> list = new ArrayList<>(ids);
Collections.shuffle(list);
ids = new LinkedHashSet<>(list);
}
System.gc(); // Clean up temp data before starting the timer.
// -- PHASE 2: INDEX DATA ---------------------------------------------------------------------------------
// Iterate over the ID set and fill a map indexed by parent IDs. The map values are irrelevant here, so
// use empty objects.
final long timer = System.currentTimeMillis();
final HashMap<BaseID, Object> map = new HashMap<>();
for (final BaseID id : ids) {
map.put(id.getParentID(), new Object());
}
System.out.println("\t" + (System.currentTimeMillis() - timer));
// Ensure that map and IDs are not GC:ed before the timer stops.
if (map.get(new ParentID("_do_not_gc")) == null) {
map.put(new ParentID("_do_not_gc"), new Object());
}
ids.add(new ParentID("_do_not_gc"));
}
}
public static void main(final String[] args) throws IOException {
if (RUN_ID_OLD) {
run(REPETITIONS, false);
}
if (RUN_ID_NEW) {
run(REPETITIONS, true);
}
}
}
In essence, the program first generates a HashSet of IDs and then indexes these IDs by their parent ID in a HashMap. In detail:
The first phase (PHASE 1) generates 5 million parent IDs, each with 1 to 10 child IDs using either the old (e.g., XXX-00000001
) or the new ID schema (e.g., X384840564
) and one of the two hash functions. The generated child IDs are collected in a HashSet. We explicitly create new parent ID objects for each child ID to match the functionality of the original system. For experimentation, the IDs can optionally be shuffled in a LinkedHashSet to distort the hash-based ordering (cf. boolean SHUFFLE_SET
).
The second phase (PHASE 2) simulates the performance-critical path. It reads all IDs (child IDs with their parents) from the HashSet and puts them into a HashMap with the parent IDs as keys (i.e., aggregate IDs by parent).
Note: The actual retrieval system has a more complex logic, such as reading IDs from multiple sets and merging child IDs as the map entry’s values, but it turned out that none of these steps was responsible for the strong performance gap in question.
The remaining lines try to control for the GC, such that the data structures are not GC:ed too early. We’ve tried different alternatives for controlling the GC, but the results seemed pretty stable overall.
When running the program, the constants RUN_ID_OLD
and RUN_ID_NEW
activate the old and the new ID schema, respectively (best activate only one at a time). USE_PARENT_HASH
switches between the XOR hash (false) and the parent hash (true). SHUFFLE_SET
distorts the item order in the ID set. All other constants can remain as they are.
(c) Results
All results here are based on a typical Windows desktop with OpenJDK 11. We also tested Oracle JDK 8 and a Linux machine, but observed similar effects in all cases. For the following figure, we tested each configuration in independent runs, whereas each run repeats the timing 5 times. To avoid outliers, we report the median of the repetitions. Note, however, that the timings of the repetitions do not differ much. The performance is measured in milliseconds.
Observations:
- Using XOR hash yields a substantial performance drop in the HashSet setting when switching to the new ID schema. This hash function seems suboptimal, but we lack a good explanation.
- Using the parent hash function speeds up the process regardless of the ID schema. We speculate that the internal HashSet order is beneficial, since the resulting HashMap will build up the same order (because ID.hash = ID.parent.hash). Interestingly, this effect can also be observed if the HashSet is split into, say, 50 parts, each holding a random partition of the full HashSet. This leaves us puzzled.
- The entire process seems to be heavily dependent of the reading order in the for loop of the second phase (i.e., the internal order of the HashSet). If we distort the order in the shuffled LinkHashSet, we end up in a worst-case scenario, regardless of the ID schema.
- In a separate experiment, we also diagnosed the number of collisions when filling the HashMap, but could not find obvious differences when changing the ID schema.
Who can shed more light on explaining these results?
Update
The image below shows some profiling results (using VisualVM) for the non-shuffled runs. Indent indicates nested calls. All percentage values are relative to the phase 2 timing (100%).
An obious difference seems to be HashMap.putVal's self time. There was no obvious difference for treeifying buckets.