0

I am monitoring the performance and CPU of a large java application , using VisualVM. When I look at its memory profile I see maximum heap (about 50%) is being used up by char arrays.

Following is a screenshot of the memory profile:

enter image description here

In the memory profile at any given time i see roughly about 9000 char[] objects.

The application accepts a large file as input. The file roughly has about 80 lines each line consisting of 15-20 delimited config options. The application parses the file and stores these lines in a ArrayList of Strings. It then parses these string to get the individual config options for each server.

The application also frequently logs each event to the console.

Java implementation of Strings uses char[] internally along with a reference to array and 3 integer.

From different posts on the internet it seems like StringBuffer , StringBuilder , String.intern() are more memory efficient data types.

How do they compare to java.lang.String ? Has anybody benchmarked them ? If the application uses multithreading (which it does)are they a safe alternative ?

davison
  • 335
  • 1
  • 3
  • 16
  • How many `String` objects do you have? – Sotirios Delimanolis Oct 10 '14 at 00:28
  • 3
    What are you actually using strings for? None of these is going to be "automatically" more efficient. We need more details. – Louis Wasserman Oct 10 '14 at 00:48
  • VisualVM can help you more: are you allocating lots of small arrays or a few big arrays? In either case, what incoming references are keeping the arrays alive? Then consider if your program actually uses all that data, or if some can be released earlier. – Jeffrey Bosboom Oct 10 '14 at 01:20
  • @JarrodRoberson How does the other question answer this one ? I dont understand. – davison Oct 10 '14 at 02:04
  • did you read the answer with currently 301 up votes and the comments on that answer, or the other answers and there comments. I will sum up : *From different posts on the internet it seems like StringBuffer , StringBuilder , String.intern() are more memory efficient data types. How do they compare to java.lang.String ? **Has anybody benchmarked them?***, **Yes**, many times and implementations vary by vendor. –  Oct 10 '14 at 02:14
  • 1
    @JarrodRoberson *And have you read **this question**? It's titled **Optimizing heap usage**. There's no heap in the linked question and when speaking about memory, they mean GC overhead.* This question is maybe a duplicate of something else, surely a confused one, but mostly unrelated to the linked one. The part you cited in bold is a just that: a small part of the question. – maaartinus Oct 10 '14 at 03:18
  • 1
    `Strings` are `Strings` are `Strings` in Java. How they become instances of Strings is relatively unimportant because the `StringBuffer/StringBuilder` becomes eligible for collection when you release it. And Garbage Collection hinges on two things, are they reachable and what is the memory pressure? [The rest is pretty much completely out of your control.](http://stackoverflow.com/a/11769540/177800) –  Oct 10 '14 at 03:35
  • 1
    If you want to optimize heap usage, make sure the `Strings` in your `List` get removed from the `List` and all references are release as soon as you are done with them. There is no deep profiling to this, that is Java 101. Unless you are getting `OutOfMemoryExceptions` there is no problem here. Just look at the `Related` side bar, this has been discussed ad nauseum on Stackoverflow. –  Oct 10 '14 at 03:38
  • 3
    Why are you reading the file into an ArrayList at all? Can't you process it line by line? – user207421 Oct 10 '14 at 07:47
  • @JarrodRoberson Sure, Strings are Strings, but for *rarely needed storage* you can use `byte[]` and safe up to 50%, assuming you're using some non-exotic language. This week, I wrote such a thing (I'll probably throw it away, but that's a different story). Agreed, that the builder and buffer solve nothing, but these were just OP's ideas (the XY problem). I agree that there are many similar questions, but the one linked does not give the answer. – maaartinus Oct 10 '14 at 16:01
  • @maaartinus - you didn't read the part where these strings are then parsed further to get configuration details. so they have to be `String` to be parsed by an sane defintion so raw byte arrays just increase your storage by 50% or more because you have to convert to a `String` anyway. I do Erlang and representing textual content as byte arrays is extremely painful there as it is everywhere else. –  Oct 10 '14 at 16:10
  • @JarrodRoberson I have read it all (and forgot most of it again). The OP stated that they need a `ArrayList`... and I can write such a list without storing a single string in it (it'd be slow because of the on the fly conversion and it'd produce tons of garbage, but it *would save half the memory* for English texts). – maaartinus Oct 10 '14 at 16:31
  • Define *very large*, your *very large* is probably not the same as someone that works with hundreds to thousands of terabytes of data at a time; and those people don't worry about how a string is represented internally in Java, this reeks of *Premature Optimization* –  Oct 11 '14 at 02:52

1 Answers1

1

What I do is is have one or more String pools. I do this to a) not create new Strings if I have one in the pool and b) reduce the retained memory size, sometimes by a factor of 3-5. You can write a simple string interner yourself but I suggest you consider how the data is read in first to determine the optimal solution. This matters as you can easily make matters worse if you don't have an efficient solution.

As EJP points out processing a line at a time is more efficient, as is parsing each line as you read it. i.e. an int or double takes up far less space than the same String (unless you have a very high rate of duplication)


Here is an example of a StringInterner which takes a StringBuilder to avoid creating objects needlessly. You first populate a recycled StringBuilder with the text and if a String matching that text is in the interner, that String is returned (or a toString() of the StringBuilder is.) The benefit is that you only create objects (and no more than needed) when you see a new String (or at least one not in the array) This can get a 80% to 99% hit rate and reduce memory consumption (and garbage) dramatically when loading many strings of data.

public class StringInterner {
    @NotNull
    private final String[] interner;
    private final int mask;

    public StringInterner(int capacity) {
        int n = nextPower2(capacity, 128);
        interner = new String[n];
        mask = n - 1;
    }

    @Override
    @NotNull
    public String intern(@NotNull CharSequence cs) {
        long hash = 0;
        for (int i = 0; i < cs.length(); i++)
            hash = 57 * hash + cs.charAt(i);
        int h = hash(hash) & mask;
        String s = interner[h];
        if (isEqual(s, cs))
            return s;
        String s2 = cs.toString();
        return interner[h] = s2;
    }

    static boolean isEqual(@Nullable CharSequence s, @NotNull CharSequence cs) {
        if (s == null) return false;
        if (s.length() != cs.length()) return false;
        for (int i = 0; i < cs.length(); i++)
            if (s.charAt(i) != cs.charAt(i))
                return false;
        return true;
    }

    static int nextPower2(int n, int min) {
        if (n < min) return min;
        if ((n & (n - 1)) == 0) return n;
        int i = min;
        while (i < n) {
            i *= 2;
            if (i <= 0) return 1 << 30;
        }
        return i;
    }

    static int hash(long n) {
        n ^= (n >> 43) ^ (n >> 21);
        n ^= (n >> 15) ^ (n >> 7);
        return (int) n;
    }
}

This class is interesting in that it is not thread safe in the tradition sense, but will work correctly when used concurrently, in fact might work more efficiently when multiple threads have different views of the contents of the array.

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
  • 3
    naive object pooling in Java is more harmful than anything in these situations, if someone is asking a basic question like this they aren't going to get a self implemented `String` pool correct for sure. That is bad advice to someone asking a question like this that demonstrates a lack of advanced understanding of how memory works in the different JVM implementations. Pools are a form of caching and caching and concurrency are hard to get correct, even by "senior" Java people. The second paragraph is valid, but as you point out, it is just affirming what is already said by someone else. –  Oct 10 '14 at 16:14
  • 1
    @JarrodRoberson I agree, you need experience in doing this, but you won't get that experience if you never try. BTW a pool of immutable objects is surprisingly easy to implement concurrently. Pooling mutable objects which is much harder to get right, and efficient. – Peter Lawrey Oct 10 '14 at 16:22
  • 1
    @JarrodRoberson +1. The OP should first make clear what they need, at least to themselves. But there's no need for implementing [interners](http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/collect/Interners.html) again. – maaartinus Oct 10 '14 at 16:38
  • 1
    @PeterLawrey I was just about to write how foolproof such a pooling is, but noticed that it's easy to get it wrong... just use an unsynchronized `LinkedHashMap` and get an endless loop in `get`. – maaartinus Oct 10 '14 at 16:41
  • @maaartinus Using LinkedHashMap is likely to be a mistake, just because it can be more expensive than allocating a simple object. ;) e.g. you need a key you can look up which for StringBuilder means creating a String with a char[], and to add it means creating an Map.Entry. – Peter Lawrey Oct 10 '14 at 16:58
  • @PeterLawrey Sure, but the Strings may be long or repeating a lot. Actually, Guava's `Interner` is backed by a `Map` and probably pretty expensive, too. Do you know something better? (I guess, I could write it myself... it's fun, but...) – maaartinus Oct 10 '14 at 17:12
  • @JarrodRoberson agree with peter. All of us are here to learn together. One has to begin somewhere. That's the spirit of this forum. – davison Oct 10 '14 at 18:13
  • this particular case, this *solution* would just be overhead, read the lines; ( line by line or all at once really no different in this case as long as the raw lines are remove from the list), parse them into the configuration objects and and let the garbage collector do its work with the junk you produced. the parsed information is the only thing that is important or useful, anything else is just over complication. –  Oct 10 '14 at 18:54
  • @JarrodRoberson Having used this solution in a number of case, it doesn't overhead under heavy load. It may be an unnecessary complication, but that depends on the use case. – Peter Lawrey Oct 10 '14 at 21:27
  • for configrations for servers as give as the heuristic, unless you are google or amazon I find it hard to believe that server configurations that are line based requires a "solution" like of any kind, really I load hundreds of gigabytes of text data parse and process it in systems all the time on VMs with single digit GB of RAM, this reeks of [Premature Optimization](http://c2.com/cgi/wiki?PrematureOptimization) –  Oct 11 '14 at 02:50
  • @jarrodroberson we are working a system where this reduces the memory usage from 2.5 GB per thread to 1 GB per thread. This means that a 30 GB heap on a 32 GB machine can handle 30 threads of work instead of 12. It 32 cpus. – Peter Lawrey Oct 11 '14 at 08:43
  • 1
    your make my point for me, your problem domain isn't the OP problem domain by a few orders of magnitude –  Oct 11 '14 at 14:52