What is the performance impact of interning all strings in Java?

Question

I am working on a trade processing application where I have to deal with a lot of strings. Some of those strings are non-repeating such as a Trade ID whereas others repeat frequently such as Product ID.

I am considering interning all trade attributes as a generic step while parsing the Trade message (JSON) to reduce the memory usage and speed up equality checks.

My question is whether I might unintentionally degrade performance with this move?

Are you at that point of tuning your application that this is a valid issue, or is this a premature micro-optimization that you're thinking of doing without a good reason? Measure, measure, measure... — Kayaman, Nov 12 '17 at 05:54
I don't think that interning can speed up the parsing as every string must be extracted from the input JSON first and then interned. So, I guess, that parsing will be slower, but it may help later. There is [a JSON parser](https://github.com/square/moshi/blob/master/moshi/src/main/java/com/squareup/moshi/ClassJsonAdapter.java#L158) which works with bytes and matches the known attribute names without producing the strings. — maaartinus, Nov 12 '17 at 18:37

apangin · Accepted Answer · 2017-11-12T19:03:53.657

Deduplicating common strings is usually a good idea to save memory.
But never use String.intern for deduplication!

String.intern is a native method; each call suffers from additional JNI overhead.
It blows internal hashtable which is shared among all JVM parts (e.g. class loading).
The default capacity of string table is not large enough, and the number of buckets is constant.
It may increase GC pauses since JVM scans this internal hashtable and possibly rehashes it during stop-the-world phase.
More details in this presentation.

A regular HashMap or ConcurrentHashMap can be a on order of magnitude better for this task.

The following benchmark compares the performance of String.intern to [Concurrent]HashMap.putIfAbsent on the set of 1M strings:

@State(Scope.Benchmark)
public class Dedup {
    private static final HashMap<String, String> HM = new HashMap<>();
    private static final ConcurrentHashMap<String, String> CHM = new ConcurrentHashMap<>();

    private static final int SIZE = 1024 * 1024;
    private static final String[] STRINGS = new Random(0).ints(SIZE)
            .mapToObj(Integer::toString)
            .toArray(String[]::new);

    int idx;

    @Benchmark
    public String intern() {
        String s = nextString();
        return s.intern();
    }

    @Benchmark
    public String hashMap() {
        String s = nextString();
        String prev = HM.putIfAbsent(s, s);
        return prev != null ? prev : s;
    }

    @Benchmark
    public String concurrentHashMap() {
        String s = nextString();
        String prev = CHM.putIfAbsent(s, s);
        return prev != null ? prev : s;
    }

    private String nextString() {
        return STRINGS[++idx & (SIZE - 1)];
    }
}

The results on JDK 9 (smaller is better):

Benchmark                Mode  Cnt    Score    Error  Units
Dedup.concurrentHashMap  avgt   10   91,208 ±  0,569  ns/op
Dedup.hashMap            avgt   10   73,917 ±  0,602  ns/op
Dedup.intern             avgt   10  832,700 ± 73,402  ns/op

I guess, Guava's [Interners](https://google.github.io/guava/releases/23.0/api/docs/com/google/common/collect/Interners.html) are the right tool. AFAIK, the interning table does not resize at all, but there's `-XX:StringTableSize=1000003` to the resque. The default size on Java 8 is 60013, which explains the slowness. With a saner size, it might get usable. — maaartinus, Nov 12 '17 at 18:27

What is the performance impact of interning all strings in Java?

1 Answers1