Java 8, memory wasted by duplicate strings

Question

I'm investigating a memory leak in my Grails 3.3.10 server that is running on a Java 8 JVM. I took a heap dump from a production server that was running low on memory and analysed it with JXRay. The html report says that some memory is wasted on duplicate strings with 19.6% overhead. Most of it is wasted on duplicates of the empty string "" and it is mostly coming from database reads. I have two questions about this.

Should I start interning strings or is it too costly of an operation to be worth it?
Quite a bit of my code deals with deeply nested JSON structures from elasticsearch and I didn't like the fragility of the code so I made a small helper class to avoid typos when accessing data from the json.

public static final class S {
    public static final String author      = "author";
    public static final String aspectRatio = "aspectRatio";
    public static final String userId      = "userId";
    ... etc etc

That helps me avoid typos like so:

    Integer userId = json.get("userid"); // Notice the lower case i. This returns null and fails silently
    Integer userId = json.get(S.userId); // If I make a typo here the compiler will tell me.

I was reasonably happy about this, but now I'm second guessing myself. Is this a bad idea for some reason? I haven't seen anyone else do this. That shouldn't cause any duplicate strings to be created because they are created once and then referenced in my parsing code, right?

If your goal is to bring down memory consumption, then yes, maybe you should be checking your strings, like `if (str.length == 0) str = ""; //this will use the intern version` — ControlAltDel, Oct 24 '20 at 12:13

Edwin Buck · Answer 1 · 2020-10-24T15:30:30.567

The problem with a String holding class is that you are using a language against its language design.

Classes are supposed to introduce types. A type that provides no utility, because it's an "Everything that can be said with a string" type is rarely useful. While there are some patterns of this occurring in many programs, typically they introduce more behavior than "all the stuff is here." For example, locale databases provide replacement strings for different languages.

I'd start by carving out the sensible enumerations. Error messages might easily be converted into enums, which have easy auto-convert string representations. That way you get your "typo detection" and a classification built-in.

 DiskErrors.DISK_NOT_FOUND
 Prompts.ASK_USER_NAME
 Prompts.ASK_USER_PASSWORD

The side-effect of changes like this can hit your desired goal; but beware, these kinds of changes often signal the loss of readability.

Readability isn't what you think is easy to read, it's what a person who has never used the code would think is easy to read.

If I were to see a problem with "Your selected hard drive was not found", then I'd look through the code base for a string "Your selected hard drive was not found". That could land me in two places:

In the block of code were the error message was raised.
In a table mapping that string to a name.
In many blocks of code where the same error message is raised.

With the table mapping, I can then do a second search, searching for where the name is used. That can land me with a few scenarios:

It is used in one place.
It is used in many places.

With one place, a kind of code maintenance problem arises. You now have a constant that is not used by any other part of the code maintained in a place that is not near where it is used. This means that to do any change that requires full understanding of the impact, someone has to keep the remote constant's value in mind to know if the logical change should be combined with an updated error message. It's not the updating of the error message that causes the extra chance for error, it's the fact that it is removed from the code being worked on.

With multiple places, I have to cycle through all of matches, which basically is the same effort as the multiple string matches in the first step. So, the table doesn't help me find the source of the error, it just adds extra steps that are not relevant to fixing the issue.

Now the table does have a distinct benefit in one scenario: When all the messages for a specific kind of issue should be updated at the same time. The problem is, that such a scenario is rare, and unlikely to happen. What is more likely to happen is that an error message is not specific enough for a certain scenario; but, after another "scan of all the places it is used" is correct for other scenarios. So the error message is split, instead of updated in place, because the coupling enforced by the lookup table means one cannot modify some of the error messages without creating a new error message.

Problems like this come from developers slipping in features that appeal to developers. In your case, you're building in an anti-typo system. Let me offer a better solution; because typos are real, and a real problem too.

Write a unit test to capture the expected output. It is rare that you will write the same typo twice, exactly the same way. Yes, it is possible, but coordinated typos will impact both systems the same. If you introduce a spelling error in your lookup table, and introduce it in the usage, the benefit would be a working program, but it would be hard to call it a quality solution (because the typos weren't protected against and are there in duplicate).

Have your code reviewed before submitting it to a build system. Reviews can get out of hand, especially with inflexible reviewers, but a good review should comment on "you spelled this wrong." If possible review the code as a team, so you can point out your ideas as they make their comments. If you have difficultly working with people (or they have difficulty working with people) you will find peer-review hard. I'm sorry if that happens, but if you can get a good peer review, it's the second "best" defense against these issues.

Sorry for the length of this reply, but I hope this gives you a chance to remember to "step back" from a solution and see how it impacts your future actions with the code.

And as for the "" String, focusing on why it is being set would probably be more effective in building a better product than patching the issue with interning (but I don't have access to your code base, so I might be wrong!)

Good luck

Stephen C · Answer 2 · 2022-06-13T10:08:03.993

Q1: Should I start interning strings or is it too costly of an operation to be worth it?

It is hard to say without more information about how the strings are being created and their typical lifetime, but the general answer is No. It is generally not worth it.

(And interning won't fix your memory leak.)

Here are some of the reasons (a bit hand-wavey I'm afraid):

Interning a String doesn't prevent the string you are interning from being created. Your code still needs to create it and the GC still needs to collect it.
There is a hidden data structure that organizes the interned strings. That uses memory. It also costs CPU to check to see if a string is in the interning data structure and add it if needed.
The GC needs to do special (weak reference like) things with the interning data structure to prevent it from leaking. That is an overhead.
An interned string tends to live longer than a non-interned string. It is more likely to be tenured to the "old" heap, which leads to its lifetime extended even longer ... because the "old" heap is GC'ed less often.

If you are using the G1 collector AND the duplicate strings are typically long lived, you might want to try enabling G1GC string deduplication (see here). Otherwise, you are probably better off just letting the GC deal with the strings. The Java GC's are designed to efficiently deal with with lots of objects (such as strings) being created and thrown away soon after.

If it is your code that is creating the Java strings, then it might be worth tweaking it to avoid creating new zero length strings. Manually interning the zero length strings as per @ControlAltDel's comment is probably not worth the effort.

Finally, if you are going to try to reduce the duplication one way or another, I would strongly advise that you set things up so that you can measure the effects of the optimization:

Do you actually save memory?
Does this affect the rate of GC runs?
Does this affect GC pauses?
Does it affect request times / throughput?

If the measurements say that the optimization hasn't helped, you need to back it out.

Q2: Is this a bad idea for some reason? That shouldn't cause any duplicate strings to be created because they are created once and then referenced in my parsing code, right?

I can't think of any reason not to do that. It certainly doesn't lead directly to creating of duplicate strings.

On the other hand, you won't reduce string duplication simply by doing that. String objects that represent literals get interned automatically.

Hey @Stephen, i want to enable this String deduplication feature . Will it cause any error in my code if the strings are shared? — Asgar, Jun 12 '23 at 13:35
For example, if your code (incorrectly) compares strings using `==`, or if it uses nasty reflection to break string immutability ... then string deduplication is liable to *change* the way that your code behaves. But the real solution is not to do that kind of thing. — Stephen C, Jun 12 '23 at 13:55

Java 8, memory wasted by duplicate strings

2 Answers2