9

I have a large number of name - value pairs (approx 100k) that I need to store in some sort of cache (say a hash map) where the value is a string with an average of about 30k bytes in size.

Now I know for a fact that a large number of the values have exactly the same string data. In order to avoid having to allocate the identical string data several times, I would like to somehow reuse a previously allocated string thus consuming less memory. In addition this needs to be reasonably fast. i.e. scanning through all the previously allocated values one-by-one is not an option.

Any recommendations on how I could solve this problem?

Kishnan
  • 659
  • 2
  • 8
  • 16

7 Answers7

10

Do not use String.intern (there have been various memory issues related to this through the years). instead, create your own cache, similar to String.intern. basically, you want a Map, where each key maps to itself. then, before caching any string, you "intern" it:

private Map<String,WeakReference<String>> myInternMap = new WeakHashMap<String,,WeakReference<String>>();
public String intern(String value) {
  synchronized(myInternMap) {
    WeakReference<String> curRef = myInternMap.get(value);
    String curValue = ((curRef != null) ? curRef.get() : null);
    if(curValue != null) {
      return curValue;
    }

    myInternMap.put(value, new WeakReference<String>(value));
    return value;
  }
}

note, you use weakreferences for the keys and values so that you don't keep references for strings which you are no longer using.

james
  • 1,379
  • 7
  • 6
  • 2
    No, this is very BAD advice. Most such comments refer to rather old issues for now obsolete JVMs. There is absolutely nothing wrong with String.intern() for long-living shared Strings. Much less than issues with roll-your-own replacements. – StaxMan Apr 16 '09 at 18:22
  • …especially roll-your-own replacements that recommend WeakReferences. Never use WeakReferences for a cache, they cause grotesque cache thrashing. If you are writing your own cache, you should be using SoftReference's. – Recurse Nov 15 '10 at 01:03
  • The size of the permanent generation, which is used for intern()ed strings (at least by HotSpot) is still limited. As long as you tune it, that's OK: (e.g. http://jira.codehaus.org/browse/XSTR-395). @Recurse: What is needed is not really a cache, so your comment about SoftRefs does not apply. If you cache disk data, you want SoftRefs, to avoid throwing away data which is not being used. Here you want the opposite, i.e. a weak object pool: http://www.codeinstructions.com/2008/09/instance-pools-with-weakhashmap.html http://www.javamex.com/tutorials/memory/string_saving_memory.shtml – Blaisorblade Nov 20 '10 at 12:19
9

String.intern() will help you here (most likely). It will resolve multiple instances of the same string down to one copy.

EDIT: I suggested this would 'most likely' help. In what scenarios will it not ? Interning strings will have the effect of storing those interned string representations permanently. If the problem domain is a one-shot process, this may not be an issue. If it's a long running process (such as a web app) then you may well have a problem.

I would hesitate to say never use interning (I would hesistate to say never do anything). However there are scenarios where it's not ideal.

Brian Agnew
  • 268,207
  • 37
  • 334
  • 440
  • String.intern can be quite slow. It also places the String into the permanent generation, which could well cause GC performance problems. – Tom Hawtin - tackline Apr 07 '09 at 13:52
  • The permanent generation is an issue, granted. The question doesn't have the context in which this is to be used. If it's a standalone app, then it may well be ok. Otherwise (say a ong running web app), then not. As ever, solutions need to be evaluated in the context of where they'll be used. – Brian Agnew Apr 07 '09 at 14:08
  • @Brian Agnew: My I suggest you edit and expand your answer then to include context? Comments don't count, if you get my drift. – Stu Thompson Apr 07 '09 at 14:17
4

String.intern is the obvious choice as Brian says. But if you don't want to intern across all the String in memory, you can use a Set to first see if the value is present. Here's untested code. You will have to work out removing from reverse map when removing from main

  class Map2<K, V> implements Map<K, V>
  {
    Map<K, V> _map = Maps.newHashMap();
    Set<V, V> _rev = Maps.newHashMap();

    V put(K k, V v) {
      if (_rev.containsKey(v)) {
        V prev = _rev.get(v);
        return _map.put(k, prev);
      } else {
        _rev.put(v, v);
        return _map.put(k,v);
      }
   }
Miserable Variable
  • 28,432
  • 15
  • 72
  • 133
  • ConcurrentMap has putIfAbsent, which might be useful. – Tom Hawtin - tackline Apr 07 '09 at 13:55
  • I like this solution, it's no overkill with weak references etc. To optimize even more on storage, one could just search the existing values in the Map, given that the total number is small (say <10000). Upvote! – Ingo Apr 08 '09 at 07:48
  • @Ingo: searching through even 1000 values rather than performing a lookup is a bad idea. The original question talks about 100k name - value pairs. – Blaisorblade Nov 20 '10 at 12:29
  • Weak references are crucial. This solution will leak memory when removing a mapping (k, oldv), even by replacing it with (k, newv), because _rev keeps a ref to oldv. And you can't remove oldv, because it can be needed by another _map entry - you'd need to count such references, which is complex and creates overhead. Some use cases might be safe, but relying on that is fragile, especially here, since implementing the Map interfaces allows using it in all scenarios. Moreover, such a solution might be used only if values are equal - the question only guarantees that they share equal Strings. – Blaisorblade Nov 20 '10 at 12:44
1

It somewhat depends upon how you are creating the String.

One possible way is to use TreeSet that uses a Comparator that can compare existing Strings and the source of your new String. Use SortedSet.tailSet and an Iterator to find an existing String. Or alternatively NavigableSet.ceiling/floor or a TreeMap with a similar setup.

String.intern has performance problems.

Tom Hawtin - tackline
  • 145,806
  • 30
  • 211
  • 305
1

Agreed with others on not using String.intern(): once you've put a string there, it will never go away. Look to early revisions of Xerces for why this is a bad idea.

A better solution is to use a WeakHashMap, wrapping the value in a WeakReference:

private Map<String,WeakReference<String>> _map 
    = new WeakHashMap<String,WeakReference<String>>();

public synchronized String intern(String str)
{
    WeakReference<String> ref = _map.get(str);
    String s2 = (ref != null) ? ref.get() : null;
    if (s2 != null)
        return s2;
    str = new String(str);
    _map.put(str, new WeakReference(str));
    return str;
}

This code is from an article that I wrote on the Java reference objects. You'll find the explanation there.

EDIT: need to create a new string here (and I'll update the article) because the original might be a substring of a far larger character array. I thought that was fixed around JDK 1.3, but apparently not (at least not in 1.5).

kdgregory
  • 38,754
  • 10
  • 77
  • 102
  • Interning a string will not mean that it will 'never go away', you can garbage collect the perm gen though it may not be so efficient it can and will get garbage collected if there are no strong references to it. – Henry B Aug 27 '09 at 14:59
  • The permgen, at least in the Sun JVM, is managed separately from the rest of the heap. If you can point to code that removes strings from the intern table, then I'm willing to retract my statement. – kdgregory Aug 27 '09 at 18:53
0

You could compress the strings. A 30K string should get a good compression ratio. I wrote a hack to compress large String as an exercise, but you could use a byte[] of the compressed data to store the String.

A 30K character string will use about 60KB (2 bytes per character) so even using getBytes() is likely to be an improvement.

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
0

Do you actually need Strings, or do you just need any old CharSequence? If not, then consider implementing a "compact" CharSequence such as the one I suggest in the link.

Neil Coffey
  • 21,615
  • 7
  • 62
  • 83