2

I am working on a trade processing application where I have to deal with a lot of strings. Some of those strings are non-repeating such as a Trade ID whereas others repeat frequently such as Product ID.

I am considering interning all trade attributes as a generic step while parsing the Trade message (JSON) to reduce the memory usage and speed up equality checks.

My question is whether I might unintentionally degrade performance with this move?

Rajesh Kolappakam
  • 2,095
  • 14
  • 12
  • 5
    Are you at that point of tuning your application that this is a valid issue, or is this a premature micro-optimization that you're thinking of doing without a good reason? Measure, measure, measure... – Kayaman Nov 12 '17 at 05:54
  • I don't think that interning can speed up the parsing as every string must be extracted from the input JSON first and then interned. So, I guess, that parsing will be slower, but it may help later. There is [a JSON parser](https://github.com/square/moshi/blob/master/moshi/src/main/java/com/squareup/moshi/ClassJsonAdapter.java#L158) which works with bytes and matches the known attribute names without producing the strings. – maaartinus Nov 12 '17 at 18:37

1 Answers1

3

Deduplicating common strings is usually a good idea to save memory.
But never use String.intern for deduplication!

  • String.intern is a native method; each call suffers from additional JNI overhead.
  • It blows internal hashtable which is shared among all JVM parts (e.g. class loading).
  • The default capacity of string table is not large enough, and the number of buckets is constant.
  • It may increase GC pauses since JVM scans this internal hashtable and possibly rehashes it during stop-the-world phase.
  • More details in this presentation.

A regular HashMap or ConcurrentHashMap can be a on order of magnitude better for this task.

The following benchmark compares the performance of String.intern to [Concurrent]HashMap.putIfAbsent on the set of 1M strings:

@State(Scope.Benchmark)
public class Dedup {
    private static final HashMap<String, String> HM = new HashMap<>();
    private static final ConcurrentHashMap<String, String> CHM = new ConcurrentHashMap<>();

    private static final int SIZE = 1024 * 1024;
    private static final String[] STRINGS = new Random(0).ints(SIZE)
            .mapToObj(Integer::toString)
            .toArray(String[]::new);

    int idx;

    @Benchmark
    public String intern() {
        String s = nextString();
        return s.intern();
    }

    @Benchmark
    public String hashMap() {
        String s = nextString();
        String prev = HM.putIfAbsent(s, s);
        return prev != null ? prev : s;
    }

    @Benchmark
    public String concurrentHashMap() {
        String s = nextString();
        String prev = CHM.putIfAbsent(s, s);
        return prev != null ? prev : s;
    }

    private String nextString() {
        return STRINGS[++idx & (SIZE - 1)];
    }
}

The results on JDK 9 (smaller is better):

Benchmark                Mode  Cnt    Score    Error  Units
Dedup.concurrentHashMap  avgt   10   91,208 ±  0,569  ns/op
Dedup.hashMap            avgt   10   73,917 ±  0,602  ns/op
Dedup.intern             avgt   10  832,700 ± 73,402  ns/op
apangin
  • 92,924
  • 10
  • 193
  • 247
  • I guess, Guava's [Interners](https://google.github.io/guava/releases/23.0/api/docs/com/google/common/collect/Interners.html) are the right tool. AFAIK, the interning table does not resize at all, but there's `-XX:StringTableSize=1000003` to the resque. The default size on Java 8 is 60013, which explains the slowness. With a saner size, it might get usable. – maaartinus Nov 12 '17 at 18:27
  • Right, JVM may rehash, but not resize the string table. – apangin Nov 12 '17 at 19:05
  • @apangin, thank you very much for the detailed explanation – Rajesh Kolappakam Nov 15 '17 at 13:30