40

I'm looking for a way to store a string->int mapping. A HashMap is, of course, a most obvious solution, but as I'm memory constrained and need to store 2 million pairs, 7 characters long keys, I need something that's memory efficient, the retrieval speed is a secondary parameter.

Currently I'm going along the line of:

List<Tuple<String, int>> list = new ArrayList<Tuple<String, int>>();
list.add(...); // load from file
Collections.sort(list);

and then for retrieval:

Collections.binarySearch(list, key); // log(n), acceptable

Should I perhaps go for a custom tree (each node a single character, each leaf with result), or is there an existing collection that fits this nicely? The strings are practically sequential (UK postcodes, they don't differ much), so I'm expecting nice memory savings here.

Peter O.
  • 32,158
  • 14
  • 82
  • 96
skolima
  • 31,963
  • 27
  • 115
  • 151
  • 6
    Try to follow the principal of do the easiest thing, unless tests show it's too slow. Your binary search method should work, but have you tested HashMap memory consumption? – Dave Oct 13 '11 at 14:59
  • 2
    I did - and with `HashMap`, the program sometimes explodes with `OutOfMemoryException` when reloading the data. I'm hoping I can avoid this with a better choice of data structure. – skolima Oct 13 '11 at 15:07
  • Use TreeMap. You have the memory benefit of a binary search tree with the programming convenience of a map lookup. Look at my reply below. – Jagat Oct 13 '11 at 15:13
  • Is your whole system running out of memory or only your Java VM? Is this something that will be done by a lot of 'users' or will this run on a dedicated machine? Simply adding a few gigs of mem might be the cheapest ('easiest') solution. – Jochem Oct 13 '11 at 18:50
  • Also, does this list change a lot? Or is it actually baked only once (in a while)? Is it sorted upfront? Depending on these answers, you could think of flattening the whole thing to disk and run a binary search with file-seeks. I did it once, worked quite well. It's quite on the far end of the speed-memory trade off, but that was basically what you asked for. – Jochem Oct 13 '11 at 18:55
  • 4
    The node-per-character tree is better known as a "trie." – fluffy Oct 13 '11 at 18:59

16 Answers16

58

Edit: I just saw you mentioned the String were UK postcodes so I'm fairly confident you couldn't get very wrong by using a Trove TLongIntHashMap (btw Trove is a small library and it's very easy to use).

Edit 2: Lots of people seem to find this answer interesting so I'm adding some information to it.

The goal here is to use a map containing keys/values in a memory-efficient way so we'll start by looking for memory-efficient collections.

The following SO question is related (but far from identical to this one).

What is the most efficient Java Collections library?

Jon Skeet mentions that Trove is "just a library of collections from primitive types" [sic] and, that, indeed, it doesn't add much functionality. We can also see a few benchmarks (by the.duckman) about memory and speed of Trove compared to the default Collections. Here's a snippet:

                      100000 put operations      100000 contains operations 
java collections             1938 ms                        203 ms
trove                         234 ms                        125 ms
pcj                           516 ms                         94 ms

And there's also an example showing how much memory can be saved by using Trove instead of a regular Java HashMap:

java collections        oscillates between 6644536 and 7168840 bytes
trove                                      1853296 bytes
pcj                                        1866112 bytes

So even though benchmarks always need to be taken with a grain of salt, it's pretty obvious that Trove will save not only memory but will always be much faster.

So our goal now becomes to use Trove (seen that by putting millions and millions of entries in a regular HashMap, your app begins to feel unresponsive).

You mentioned 2 million pairs, 7 characters long keys and a String/int mapping.

2 million is really not that much but you'll still feel the "Object" overhead and the constant (un)boxing of primitives to Integer in a regular HashMap{String,Integer} which is why Trove makes a lot of sense here.

However, I'd point out that if you have control over the "7 characters", you could go even further: if you're using say only ASCII or ISO-8859-1 characters, your 7 characters would fit in a long (*). In that case you can dodge altogether objects creation and represent your 7 characters on a long. You'd then use a Trove TLongIntHashMap and bypass the "Java Object" overhead altogether.

You stated specifically that your keys were 7 characters long and then commented they were UK postcodes: I'd map each postcode to a long and save a tremendous amount of memory by fitting millions of keys/values pair into memory using Trove.

The advantage of Trove is basically that it is not doing constant boxing/unboxing of Objects/primitives: Trove works, in many cases, directly with primitives and primitives only.

(*) say you only have at most 256 codepoints/characters used, then it fits on 7*8 == 56 bits, which is small enough to fit in a long.

Sample method for encoding the String keys into long's (assuming ASCII characters, one byte per character for simplification - 7 bits would be enough):

long encode(final String key) {
    final int length = key.length();
    if (length > 8) {
        throw new IndexOutOfBoundsException(
                "key is longer than 8 characters");
    }
    long result = 0;
    for (int i = 0; i < length; i++) {
        result += ((long) ((byte) key.charAt(i))) << i * 8;
    }
    return result;
}
Community
  • 1
  • 1
TacticalCoder
  • 6,275
  • 3
  • 31
  • 39
  • 2
    So just to dumb this down a notch, you essentially wind up with two sets of data to work with. The set which maps from the String keys to a long key then the actual data set which uses the long key and the int data. That is still searchable because its trivial to map the String key to a long key but its more memory efficient because longs and ints have less overhead as primitives. Is that close? – Freiheit Oct 13 '11 at 17:02
  • 2
    @Adamski: thanks ; ) Keep in mind that you don't just need to use Long/Integer: you also **must** use a library like Trove if you want a hashmap backed by primitives and primitives only. This way you'll save a lot of memory. – TacticalCoder Oct 13 '11 at 17:27
  • 2
    @Freiheit: yup that is basically it. But you don't really need a "set of data" to do the String <---> int mapping: you can just use a method *postcodeToInt()* and another method *intToPostcode()*. Trove is really, much, much more efficient because it's a library backed by primitives only, specifically for this type of use (millions of entries etc.). And it's really easy to use: it's basically one *.jar* to add to your project. – TacticalCoder Oct 13 '11 at 17:30
  • @skolima: thanks for adding the *encode(...)* method. Now someone should add the *decode(...)* : ) – TacticalCoder Oct 14 '11 at 13:31
  • I think that you can comment now. :) Great answer. – Hovercraft Full Of Eels Oct 14 '11 at 21:18
  • A nice alternative to Trove is the Apache Mahout Collections. The purpose is the same, collections with primitive types. Which is better I do not know but would like to see them compared. – Rui Marques Mar 01 '14 at 13:59
25

Use the Trove library.

The Trove library has optimized HashMap and HashSet classes for primitives. In this case, TObjectIntHashMap<String> will map the parameterized object (String) to a primitive int.

Erick Robertson
  • 32,125
  • 13
  • 69
  • 98
  • 5
    This is a good answer but I would have thought that the String storage overhead completely dwarfed the space saved in using ints instead of Integers. – Adamski Oct 13 '11 at 15:08
  • 3
    That's why in addition to using Trove (which is definitely the way to go, I +1'ed Erick's answer) you also do a String-to-long mapping with the postcode and use not a *TObjectIntHashMap{String}* but a *TLongIntHashMap*. See my answer for more details. – TacticalCoder Oct 13 '11 at 17:31
9

First of, did you measure that LinkedList is indeed more memory efficient than a HashMap, or how did you come to that conclusion? Secondly, a LinkedList's access time of an element is O(n), so you cannot do efficient binary search on it. If you want to do such approach, you should use an ArrayList, which should give you the beast compromise between performance and space. However, again, I doubt that a HashMap, HashTable or - in particular - a TreeMap would consume that much more memory, but the first two would provide constant access and the tree map logarithmic and provide a nicer interface that a normal list. I would try to do some measurements, how much the difference in memory consumption really is.

UPDATE: Given, as Adamski pointed out, that the Strings themselves, not the data structure they are stored in, will consume the most memory, it might be a good idea to look into data structures that are specific for strings, such as tries (especially patricia tries), which might reduce the storage space needed for the strings.

Janick Bernet
  • 20,544
  • 2
  • 29
  • 55
  • 1
    From Collections.binarySearch() docs: `If the specified list does not implement the RandomAccess interface and is large, this method will do an iterator-based binary search that performs O(n) link traversals and O(log n) element comparisons.` – Qwerky Oct 13 '11 at 15:07
  • 1
    Ok, so that would indicate `O(n)` access to me then, which is pretty slow for 2million items. But the op changed it ArrayList in the meantime anyway :) – Janick Bernet Oct 13 '11 at 15:12
  • 1
    After reading this (and other) great answers, I'm going with http://code.google.com/p/patricia-trie/ Patricia Trie implementation. – skolima Oct 13 '11 at 15:44
7

What you are looking for is a succinct-trie - a trie which stores its data in nearly the least amount of space theoretically possible.

Unfortunately, there are no succinct-trie classes libraries currently available for Java. One of my next projects (in a few weeks) is to write one for Java (and other languages).

In the meanwhile, if you don't mind JNI, there are several good native succinct-trie libraries you could reference.

BlueRaja - Danny Pflughoeft
  • 84,206
  • 33
  • 197
  • 283
5

Have you looked at tries. I've not used them but they may fit with what you're doing.

pillingworth
  • 3,238
  • 2
  • 24
  • 49
4

A custom tree would have the same complexity of O(log n), don't bother. Your solution is sound, but I would go with an ArrayList instead of the LinkedList because the linked list allocates one extra object per stored value, which will amount to a lot of objects in your case.

solendil
  • 8,432
  • 4
  • 28
  • 29
4

As Erick writes using the Trove library is a good place to start as you save space in storing int primitives rather than Integers.

However, you are still faced with storing 2 million String instances. Given that these are keys in the map, interning them won't offer any benefit so the next thing I'd consider is whether there's some characteristic of the Strings that can be exploited. For example:

  • If the Strings represent sentences of common words then you could transform the String into a Sentence class, and intern the individual words.
  • If the Strings only contain a subset of Unicode characters (e.g. only letters A-Z, or letters + digits) you could use a more compact encoding scheme than Java's Unicode.
  • You could consider transforming each String into a UTF-8 encoded byte array and wrapping this in class: MyString. Obviously the trade-off here is the additional time spent performing look-ups.
  • You could write the map to a file and then memory map a portion or all of the file.
  • You could consider libraries such as Berkeley DB that allow you to define persistent maps and cache a portion of the map in memory. This offers a scalable approach.
Adamski
  • 54,009
  • 15
  • 113
  • 152
  • Those are UK postcodes, so there's a huge overlap between their values - they are practically sequential. – skolima Oct 13 '11 at 15:17
  • That is awesome. I will post another answer with suggested solution. – Adamski Oct 13 '11 at 15:21
  • @Adamski: ah ah, you wrote your answer while I was writing mine... I simply suggest a mapping "UK postcode -> long" and hence he can use Trove and a TLongIntHashMap fully backed by primitives and primitives only : ) – TacticalCoder Oct 13 '11 at 15:26
4

maybe you can go with a RadixTree?

hage
  • 5,966
  • 3
  • 32
  • 42
2

Use java.util.TreeMap instead of java.util.HashMap. It makes use of a red black binary search tree and doesn't use more memory than what is required for holding notes containing the elements in the map. No extra buckets, unlike HashMap or Hashtable.

Jagat
  • 1,392
  • 2
  • 15
  • 25
2

I think the solution is to step a little outside of Java. If you have that many values, you should use a database. If you don't feel like installing Oracle, SQLite is quick and easy. That way the data you don't immediately need is stored on the disk, and all of the caching/storage is done for you. Setting up a DB with one table and two columns won't take much time at all.

Jordan Bentley
  • 1,309
  • 1
  • 8
  • 28
  • I can't believe no one else has suggested that a dataset that large belongs in a database. – Vala Oct 20 '11 at 11:05
  • @Thor84no In college I ported a data-mining algorithm from Java into a single SQL query. It was admittedly hard to read/maintain, but it ran >100 times faster. – Jordan Bentley Oct 20 '11 at 14:50
1

I'd consider using some cache as these often have the overflow-to-disk ability.

Rostislav Matl
  • 4,294
  • 4
  • 29
  • 53
0

The problem is objects' memory overhead, but using some tricks you can try to implement your own hashset. Something like this. Like others said strings have quite large overhead so you need to "compress" it somehow. Also try not to use too many arrays(lists) in hashtable (if you do chaining type hashtable) as they are also objects and also have overhead. Better yet do open addressing hashtable.

nesvarbu
  • 986
  • 2
  • 9
  • 24
0

You might create a key class that matches your needs. Perhaps like this:

public class MyKey implements Comparable<MyKey>
{
    char[7] keyValue;

    public MyKey(String keyValue)
    {
        ... load this.keyValue from the String keyValue.
    }

    public int compareTo(MyKey rhs)
    {
        ... blah
    }

    public boolean equals(Object rhs)
    {
        ... blah
    }

    public int hashCode()
    {
        ... blah
    }
}
DwB
  • 37,124
  • 11
  • 56
  • 82
0

try this one

OptimizedHashMap<String, int[]> myMap = new OptimizedHashMap<String, int[]>();
for(int i = 0; i < 2000000; i++)
{
  myMap.put("iiiiii" + i, new int[]{i});
}
System.out.println(myMap.containsValue(new int[]{3}));
System.out.println(myMap.get("iiiiii" + 1));

public class OptimizedHashMap<K,V> extends HashMap<K,V>
{
    public boolean containsValue(Object value) {
    if(value != null)
    {
        Class<? extends Object> aClass = value.getClass();
        if(aClass.isArray())
        {
            Collection values = this.values();
            for(Object val : values)
            {
                int[] newval = (int[]) val;
                int[] newvalue = (int[]) value;
                if(newval[0] == newvalue[0])
                {
                    return true;
                }
            }
        }
    }
    return false;
}
kba
  • 19,333
  • 5
  • 62
  • 89
Delta
  • 537
  • 2
  • 5
  • 19
0

Actually HashMap and List are too general for such specific task as a lookup of int by zipcode. You should use advantage of knowledge which data is used. One of the options is to use a prefix tree with leaves that stores the int value. Also, it could be pruned if (my guess) a lot of codes with same prefixes map to the same integer.

Lookup of the int by zipcode will be linear in such tree and will not grow if number of codes is increased, compare to O(log(N)) in case of binary search.

asamolov
  • 111
  • 3
0

Since you are intending to use hashing, you can try numerical conversions of the strings based on ASCII values. the simplest idea will be

    int sum=0;
    for(int i=0;i<arr.length;i++){
        sum+=(int)arr[i];

    }

hash "sum" using a well defined hash functions. You would use a hash function based on the expected input patterns. e.g. if you use division method

    public int hasher(int sum){
       return sum%(a prime number);
    }

selecting a prime number which is not close to an exact power of two improves performances and gives better uniformly hashed distribution of keys.

another method is to weigh the characters based on their respective position.

e.g: if you use the above method, both "abc" and "cab" will be hashed into a same location. but if you need them to be stored in two distinct location give weights for locations like we use the number systems.

     int sum=0;
     int weight=1;
     for(int i=0;i<arr.length;i++){
         sum+= (int)arr[i]*weight;
         weight=weight*2; // using powers of 2 gives better results. (you know why :))
     }  

As your sample is quite large, you'd avoid collisions by a chaining mechanism rather than using a probe sequence. After all,What method you would choose totally depends on the nature of your application.

Tharindu Rusira
  • 691
  • 9
  • 17