14

In my code, which often runs on servers I do not control the configuration of, I have collection of users and each user has a byte[] array.

Sometimes these byte[] arrays are unique to the user. Often, though, there will be large numbers users with the exact same byte[] array.

I am trying to reduce the RAM consumption of my server.

I've tried turning my byte[] arrays into strings and interning them, but then I often run into PERM-GEN out-of-memory errors. I also see a significant performance degradation with the encoding/decoding when I want to access the byte[] array for a user, and I see much increased worse-case memory usage - presuambly strings are much larger than arrays.

How can I have a Set<SoftReference<byte[]>> lookup when Java arrays are not hashable and SoftReferences do not wrap the hash of the object the point at either. A Map<byte[],SoftReference<byte[]>> is obviously also defeating itself because the key is itself and prevents collection; and Set is internally implemented in terms of Mapanyway.

So how can I intern byte[] arrays?

Will
  • 73,905
  • 40
  • 169
  • 246
  • One thing that comes to mind is the Flyweight pattern. Also have a look at http://stackoverflow.com/questions/1058149/using-a-byte-array-as-hashmap-key-java – Arno Mittelbach Jun 12 '13 at 12:24
  • I think you'll have to wrap your byte arrays, effectively boxing them e.g. with `new ByteArray(byte [] theBytes)` and never make additional long-lived references to `theBytes` outside of the `ByteArray` itself. Then softreferences to the `ByteArray`s will work correctly. You might look at `WeakHashMap` for your application as well. – Gene Jun 12 '13 at 12:40
  • How large are these byte arrays? How does the user get hold of one? – fge Jun 12 '13 at 12:48
  • @fge they arrive over the network, and are 'room' keys - variable length, sometimes short, sometimes 1K or more - on another server system. Sometimes lots of users are in the same 'room', other times they have their own and so on. – Will Jun 12 '13 at 12:50

3 Answers3

5

If you effectively have many identical arrays, use an HashSet<ByteBuffer> as a cache. You can get the ByteBuffer array with method array() and the ByteBuffer class has hashCode and equals methods. Of course it is better if your arrays are immutable.

EDIT2 The comment from @Will is exact, to be able to get back the array, use a WeakHashMap<ByteBuffer,WeakReference<ByteBuffer>> and do something like that :

public byte[] internalize(byte[] bytes) {
 ByteBuffer wrapped = ByteBuffer.wrap(bytes);
 if(cache.containsKey(wrapped)) {
  wrapped = cache.get(wrapped).get();
 }
 else {
  cache.put(wrapped, new WeakReference<ByteBuffer>(wrapped);
 }
 return wrapped.array();
}
gma
  • 2,563
  • 15
  • 14
  • How do I get them collected from the cache when the system needs memory and there are no users at that time referencing them? – Will Jun 12 '13 at 12:29
  • 1
    A `WeakHashMap` might communicate the intent even more clearly. You could store nulls as the values. – Tom Anderson Jun 12 '13 at 12:38
  • Although if you use a `WeakHashMap`, then you can wrap it using [`Collections#newSetFromMap`](http://docs.oracle.com/javase/7/docs/api/java/util/Collections.html#newSetFromMap%28java.util.Map%29) to get a `Set`, which leaves your code cleaner. – Tom Anderson Jun 12 '13 at 12:41
  • @Oak: My gripe with `Map` is that it lets an element be in one of three states: not in the map, mapped to true, and mapped to false. We only need two states, so there is redundancy. That doesn't present a practical problem, but it is inelegant. `Void` only has one possible value (null), so there are only two possible states. – Tom Anderson Jun 12 '13 at 12:45
  • @TomAnderson how can you get the key value itself from a `Map` in O(1)? We need to retrieve what we are caching? – Will Jun 12 '13 at 12:45
  • @Will: I think you want to direct that question at gma, not me! – Tom Anderson Jun 12 '13 at 12:47
  • I am now profiling this version. I think you can save a few bytes with a custom `ByteArray` class as you want to keep the computed hashCode but you don't need the read and write markers. Also, would `WeakHashMap` where each `ByteArray` key points at the exact same `byte[]` reduce the `SoftReference` cost at GC time while keeping the array collectable when its unused? – Will Jun 13 '13 at 05:38
  • The purpose of the cache is precisely to avoid having the same `byte[]` more than one time, so why having many `ByteArray` referencing the same `byte[]`?. Secondly, to permit the GC when unused, you will have to wrap the `byte[]` value into a `WeakReference`, prefer the Weak to a Soft because soft references may be or may be not garbage collected, depending on heap occupation. The WeakReference will be collected as soon as the array is no more referenced. – gma Jun 13 '13 at 07:16
2

I've tried turning my byte[] arrays into strings and interning them, but then I often run into PERM-GEN out-of-memory errors.

I agree that you need something like String.intern(), but the standard implementation is native, so not much joy.

You could have a Map<Integer,Collection<SoftReference<byte[]>>>, using the hash code of the byte arrays as the Map key. Your intern method could then look up the set of existing byte arrays with the same has code as the given byte arrays. With a good hash code that should give a small set of arrays to examine for a match.


Edit: To clarify:

Something like this:

 class ByteArrayCache
 {
      private final Map<Integer,Collection<SoftReference<byte[]>> map = new ...;

      public final byte[] intern(byte[] byteArray)
      {
           final int hash = Arrays.hashCode(byteArray);
           final Collection<SoftReference<byte[]>> arrays = map.get(hash);
           if (arrays != null) {
              // Search through arrays for a match, and return the match.
              // If no match found, add byteArray to the collection and return it
           } else {
              // create a new map entry, add byteArray to it, and return byte array
           }
      }
 }
Raedwald
  • 46,613
  • 43
  • 151
  • 237
  • How can I vacuum up orphaned `Integer` and `Set` when they are not used? And what's the RAM overhead of so much boxing in the worst case when there isn't a lot of duplication? – Will Jun 12 '13 at 12:31
  • If the arrays are large and sparse, first bytes of md5 may be a useful hash. If they are small, roll-you-own will probably suffice. – tucuxi Jun 12 '13 at 12:33
  • @tucuxi `Arrays.hashCode(byte[])` to the rescue :) But the fundemental problem is you can't subclass `byte[]` to patch that in – Will Jun 12 '13 at 12:34
  • @Will Use a HashSet instead? Avoids boxing the hash key; but you *would* need to wrap those byte arrays, to overwrite hashCode(). – tucuxi Jun 12 '13 at 12:35
  • No need to override hashCode(). It's a Map, not a Set. – Raedwald Jun 12 '13 at 12:44
1

I would implement a cache based on Guava weak values map. It guarantees that if theres no more strong references to byte array the entry will be automatically removed.

class Cache {
    private final ConcurrentMap<Key, byte[]> map = new MapMaker().weakValues().makeMap();

    private static class Key {
        byte[] a;
        int hash;

        Key(byte[] a) {
            this.a = a;
            hash = Arrays.hashCode(a);
        }

        @Override
        public int hashCode() {
            return hash;
        }

        @Override
        public boolean equals(Object obj) {
            if (obj instanceof Key) {
                return Arrays.equals(a, ((Key) obj).a);
            }
            return false;
        }
    }

    public byte[] intern(byte[] a) {
        byte[] a1 = map.putIfAbsent(new Key(a), a);
        if (a1 != null) {
            return a1; 
        }
        return a;
    }
}
Evgeniy Dorofeev
  • 133,369
  • 30
  • 199
  • 275