1

I have a dictionary as a text file mapping from 2M words to 50k words. I load this file into memory as HashMap<String, String> by reading the file line by line, splitting on a separator and invoking myMap.put(line[0], line[1]). The size of the text file is 45MB, while the HashMap uses 350MB of the heap. My goal is to reduce memory usage without harming lookup speed. myMap.values().size() returns 2M instead of 50k, suggesting that the values are stored as duplicates. Is there a way to make identical values point to the same String object?

Map<String, String> dict = new HashMap<>();
try (FileReader fr = new FileReader(FILE);
        BufferedReader br = new BufferedReader(fr)) {
    String line;
    while ((line = br.readLine()) != null) {
        String key_value[] = line.split(":");
        dict.put(key_value[0], key_value[1].intern());
    }
} catch (Exception e) {
    e.printStackTrace();
}
mossaab
  • 1,812
  • 4
  • 23
  • 44
  • 5
    If you have 2M unique words that are mapped to 50k (non unique) words, then you hashmap's size will be 2M. – assylias Jul 10 '13 at 15:30
  • 1
    The hashmaps size is based on the entries therefore the number of keys. Regarding the duplicate values: The JVM does some optimization with string values. As a string is immutable it often uses the same object for equal strings. You can't rely on that but probably your strings are already not duplicated. – André Stannek Jul 10 '13 at 15:32
  • @assylias I know. My question is how to avoid storing duplicate values. That is allowing multiple keys to point to map to the same object value. – mossaab Jul 10 '13 at 15:33
  • @stonedsquirrel. I have already verified that I have 50k values. So there are a lot of duplicated values. – mossaab Jul 10 '13 at 15:34
  • Yes, because you have 2M keys. But if a key points to an equals string as another key, it is highly likely that they are pointing to the same string object. – André Stannek Jul 10 '13 at 15:35
  • @stonedsquirrel You meant highly unlikely? If the strings are read from a file, they are all different instances. – assylias Jul 10 '13 at 15:37
  • you have to try if they are interned.. – nachokk Jul 10 '13 at 15:38
  • Have a look at this post http://stackoverflow.com/questions/5329358/treemap-or-hashmap – boxed__l Jul 10 '13 at 15:49
  • You could try to size your map appropriately: `Map dict = new HashMap<>(2_000_000, 1.0f);`. – assylias Jul 10 '13 at 16:06
  • See also: http://stackoverflow.com/questions/231051/is-there-a-memory-efficient-replacement-of-java-lang-string – assylias Jul 10 '13 at 16:08

2 Answers2

5

Regardless of whether or not duplicates point to the same objects, there will still need to be references to these objects, so size should still return the size with duplicates included.

A simple example showing this.

If you want duplicates to point to the same objects, you'll have to do this outside of the HashMap or hope the optimizer takes care of it.

Alternatives to String.intern() as joe776 suggested are possibly with a self-written collection extending some Set (since Set doesn't have a Object get(Object) method) or another HashMap (having objects point to themselves) that allows you to get a reference to the common object.

Community
  • 1
  • 1
Bernhard Barker
  • 54,589
  • 14
  • 104
  • 138
2

You can use String.intern() on the values to make them all point to the same instance. But this has other problems like using the PermGenSpace, which is not garbage collected pre-Java 1.7. You would call it like this: myMap.put(line[0], line[1].intern()).

Maybe a map based on a trie is more efficient, but I haven't used that, yet. Also depends on the nature of your Strings. The more alike your keys are, the more space the trie can save.

http://code.google.com/p/trie-map/

Also see Dukeling's answer concerning keys().size() and values().size() and the use of another map to avoid duplicate values.

Community
  • 1
  • 1
joe776
  • 1,106
  • 14
  • 23
  • I'm on Java 1.7, and have just tried `line[1].intern()`. `myMap.values().size()` still returns `2M`, and memory usage remains the same. I'll try `trie` if no canonical solutions are provided. – mossaab Jul 10 '13 at 15:41
  • 2
    +1 An aternative is to have a `Map` where the key and value are the same. You can lookup the value to see if it has been used before and reuse the same String object. This "interner" map can be discarded when you finish. – Peter Lawrey Jul 10 '13 at 15:42
  • 1
    @mossaab `myMap.values().size()` will *always* return 2M if there are 2M keys. – assylias Jul 10 '13 at 15:42
  • @mossaab The 2M keys won't change. That's the actual number you have. To check for the number of different values you could try something like `new HashSet(myMap.values()).size()`. This should give you 50k. – joe776 Jul 10 '13 at 15:44
  • @assylias I now realize that `myMap.values().size()` will always return 2M. But as memory usage remains the same, this means that either Java already does not store duplicate values, or `String.intern()` does not do what I need to. – mossaab Jul 10 '13 at 15:46
  • @mossaab Or you don't use it properly. You should (i) give a sample of your input and (ii) the code you use to create the map. That should help you get more specific answers. – assylias Jul 10 '13 at 15:47
  • @joe776. Indeed `new HashSet(myMap.values()).size()` returns 50k. But this doesn't imply that `myMap.values()` doesn't contain duplicates. – mossaab Jul 10 '13 at 15:48
  • @mossaab The JVM sometimes calls intern() on its own. If it does that in your case, calling it manually will not make diffenrece, unfortunately. As the String handling changed quite a bit in Java 7 (e.g. String no longer inside the PermGen) this is somewhat likely. – joe776 Jul 10 '13 at 15:49
  • @mossaab As intern() ensures, that equal Strings all use the same instance, there are no duplicates possible. – joe776 Jul 10 '13 at 15:51
  • @mossaab `values().size() == 2M` is still OK, as it is only a collection and might have 2M times the identical String. To really check for duplicates you should iterate over this collection yourself and compare the elements with `==`. – joe776 Jul 10 '13 at 15:58
  • @joe776 I've tried to generate random values, and memory usage increased. It seems like Java is already doing `String.intern()`, and I don't have a way to reduce memory usage. I'm accepting this answer. Thanks! – mossaab Jul 10 '13 at 16:04
  • @mossaab Thanks! Would be nice to hear if the trie made any difference in memory usage. – joe776 Jul 10 '13 at 16:12