13

I have store 111 million key-value pairs (one key can have multiple values - maximum 2/3) whose key are 50 bit Integers and values are 32 bit (maximum) Integers. Now, my requirements are:

  1. Fast Insertion of (Key, Value) pair [allowing duplicates]
  2. Fast retrieving of value/values based on key.

A nice solution of it is given here based on MultiMap. However, I want to store more key-values pairs in main memory with no/little bit performance penalty. I studied from web articles that B+ Tree, R+ Tree, B Tree, Compact Multimap etc. can be a nice solution for that. Can anybody help me:

Is there any Java library which satisfies my all those needs properly (above mentioned/other ds also acceptable. no issue with that) ? Actually, I want an efficient java library data structure to store/retrieve key-value/values pairs which takes less memory footprint and must be built in-memory.

NB: I have tried with HashMultiMap (Guava with some modification with trove) as mentioned by Louis Wasserman, Kyoto/Tokyo Cabinet etc etc.My experience is not good with disk-baked solutions. So please avoid that :). Another point is that, for choosing library/ds one important point is: keys are 50 bit (so if we assign 64bit) 14 bit will lost and values are 32 bit Int (maximum)- mostly they are 10-12-14 bits. So, we can save space there also.

Community
  • 1
  • 1
Arpssss
  • 3,850
  • 6
  • 36
  • 80
  • Most of the space will be lost to support fast insertion and removal. The the more work your insert/removal is the more compact you can be. It seems you should be able to store this in a few GB easily. What is your memory requirement? – Peter Lawrey Apr 08 '12 at 18:06
  • @PeterLawrey, I want to store all 110 million key-value in 5/6 GB. – Arpssss Apr 08 '12 at 18:13
  • So if you have 12 bytes for the key and value you can still have about 75% overhead. – Peter Lawrey Apr 08 '12 at 18:42
  • @PeterLawrey, True. But, using 2.5 GB, I can only store 30 million (using Guava). That's why, I am asking. – Arpssss Apr 08 '12 at 18:50

6 Answers6

7

I don't think there's anything in the JDK which will do this.

However, implementing such a thing is a simple matter of programming. Here is an open-addressed hashtable with linear probing, with keys and values stored in parallel arrays:

public class LongIntParallelHashMultimap {

    private static final long NULL = 0L;

    private final long[] keys;
    private final int[] values;
    private int size;

    public LongIntParallelHashMultimap(int capacity) {
        keys = new long[capacity];
        values = new int[capacity];
    }

    public void put(long key, int value) {
        if (key == NULL) throw new IllegalArgumentException("key cannot be " + NULL);
        if (size == keys.length) throw new IllegalStateException("map is full");

        int index = indexFor(key);
        while (keys[index] != NULL) {
            index = successor(index);
        }
        keys[index] = key;
        values[index] = value;
        ++size;
    }

    public int[] get(long key) {
        if (key == NULL) throw new IllegalArgumentException("key cannot be " + NULL);

        int index = indexFor(key);
        int count = countHits(key, index);

        int[] hits = new int[count];
        int hitIndex = 0;

        while (keys[index] != NULL) {
            if (keys[index] == key) {
                hits[hitIndex] = values[index];
                ++hitIndex;
            }
            index = successor(index);
        }

        return hits;
    }

    private int countHits(long key, int index) {
        int numHits = 0;
        while (keys[index] != NULL) {
            if (keys[index] == key) ++numHits;
            index = successor(index);
        }
        return numHits;
    }

    private int indexFor(long key) {
        // the hashing constant is (the golden ratio * Long.MAX_VALUE) + 1
        // see The Art of Computer Programming, section 6.4
        // the constant has two important properties:
        // (1) it is coprime with 2^64, so multiplication by it is a bijective function, and does not generate collisions in the hash
        // (2) it has a 1 in the bottom bit, so it does not add zeroes in the bottom bits of the hash, and does not generate (gratuitous) collisions in the index
        long hash = key * 5700357409661598721L;
        return Math.abs((int) (hash % keys.length));
    }

    private int successor(int index) {
        return (index + 1) % keys.length;
    }

    public int size() {
        return size;
    }

}

Note that this is a fixed-size structure. You will need to create it big enough to hold all your data - 110 million entries for me takes up 1.32 GB. The bigger you make it, in excess of what you need to store the data, the faster that insertions and lookups will be. I found that for 110 million entries, with a load factor of 0.5 (2.64 GB, twice as much space as needed), it took on average 403 nanoseconds to look up a key, but with a load factor of 0.75 (1.76 GB, a third more space than is needed), it took 575 nanoseconds. Decreasing the load factor below 0.5 usually doesn't make much difference, and indeed, with a load factor of 0.33 (4.00 GB, three times more space than needed), i get an average time of 394 nanoseconds. So, even though you have 5 GB available, don't use it all.

Note also that zero is not allowed as a key. If this is a problem, change the null value to be something else, and pre-fill the keys array with that on creation.

Tom Anderson
  • 46,189
  • 17
  • 92
  • 133
  • Thanks a lot.Can you kindly inform me: how much space (in GB - more or less) it takes to store those 110 millions entry.. And you search for every key, means 110 millions key in 17 seconds ? – Arpssss Apr 08 '12 at 23:11
  • Cool. The space requirement is OK. But, searching performance is not what I required. I want performance like .5 millions/secs (for example) as in case of Guava MultiMap. – Arpssss Apr 08 '12 at 23:16
  • I realised i had a bug in my test harness (not the code above). Now i've fixed it, the 110 million entries are for 80722005 different keys, so an average of 1.36 entries per key. Getting each key in turn took 33 seconds, or 300 nanoseconds per key. That's 3.3 million lookups per second. – Tom Anderson Apr 09 '12 at 00:18
  • Further correction: 403 nanoseconds per key. That's 2.5 million lookups per second. – Tom Anderson Apr 09 '12 at 00:35
  • OMG. I am trying it now. Thanks a lot. What processing speed you are using ? Because it is faster that Guava also. – Arpssss Apr 09 '12 at 00:45
  • My machine has a 2.2 GHz i7, and a [Geekbench score of 9942](http://www.primatelabs.ca/geekbench/mac-benchmarks/#64bit). It's quite powerful for a laptop, but not particularly powerful for a desktop. – Tom Anderson Apr 09 '12 at 01:04
  • This code performs nice. However, has two issues: 1) Gives worst performance when all query matches and when 2) when more keys have duplicates. Check this http://textuploader.com/?p=6&id=hzGIp code which takes more than 10 minutes to finish just 5 million put/get. – Arpssss Apr 09 '12 at 02:23
  • Actually, my tests suggested that it is faster when there are more values per key. The problem in your test is that the keys are not well-distributed: in the 10 million element array, you end up with the first five million slots being full, and the second five million being empty, with the length of the chain that the linear probe has to walk increasing linearly down the array. The lookup for the last key, 2500000, has to iterate over a quarter of a million slots to find its values! – Tom Anderson Apr 09 '12 at 10:29
  • OK. So, if keys are well distributed and keys will have multiple values, it will perform nice. – Arpssss Apr 09 '12 at 10:34
  • 1
    If your input is really not well-distributed like this, then a solution is to add a stirring function to the key. I've edited the code to do this - i multiply the key by a carefully-chosen magic number. With that one change, your test completes in 0 seconds. – Tom Anderson Apr 09 '12 at 10:51
  • My feeling is that a B-tree will not perform better. It might. But it's much more complicated to write. – Tom Anderson Apr 09 '12 at 10:53
  • My stirring function is not a particularly smart one. I'm sure when @Peter Lawrey copies it he can improve it :). – Tom Anderson Apr 09 '12 at 10:58
  • I checked. It takes less time. But, why you choose that number ? Any reason ? I really curious to know. – Arpssss Apr 09 '12 at 11:01
  • Another thing, all my keys are +ve Longs, so will the magic number/code change ? – Arpssss Apr 09 '12 at 13:39
  • Thanks. Really, really great help. +100K for all comments & suggestion. :) Thanks a lot. – Arpssss Apr 09 '12 at 13:41
  • 1
    I've added a comment on the magic number. I got the idea from http://www.concentric.net/~ttwang/tech/inthash.htm where you can see that there are much better, although more complicated, hash functions. The magic number should work fine for positive longs - it should work fine for all longs, of which positive ones are a subset. – Tom Anderson Apr 09 '12 at 14:41
  • @Arpssss: I have no idea what you mean by "save keys-values arrays in disk", nor why that might be a problem. I think you should post a new question about this. – Tom Anderson Apr 16 '12 at 15:01
  • Tom sorry for asking like this. But, it seems no body can help in this issue. Hope you can help with the issues of the above code written by you. My questions are here: http://stackoverflow.com/questions/11384249/java-project-make-hash-table-including-load-store-performance-better and http://stackoverflow.com/questions/11398762/custom-hashmap-code-issue. I really need your help on those. – Arpssss Jul 09 '12 at 18:10
2

Is there any Java library which satisfies my all those needs properly.

AFAIK no. Or at least, not one that minimizes the memory footprint.

However, it should be easy write a custom map class that is specialized to these requirements.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
  • thanks for replying. But, what do you think, to build a custom map/tree like structure ? – Arpssss Apr 08 '12 at 16:53
  • or have you idea about any library that solve the above purpose, with less memory footprint (comparatively) thus I can try at-least to get a benchmark. – Arpssss Apr 08 '12 at 16:57
  • @Arpssss - I'm suggesting that in a case like this you could write your own map class ... from scratch. – Stephen C Apr 09 '12 at 00:23
2

It's a good idea to look for databases, because problems like these are what they are designed for. In recent years Key-Value databases became very popular, e.g. for web services (keyword "NoSQL"), so you should find something.

The choice for a custom data structure also depends if you want to use a hard drive to store your data (and how safe that has to be) or if it completely lost on program exit.

If implementing manually and the whole db fits into memory somewhat easily, I'd just implement a hashmap in C. Create a hash function that gives a (well-spread) memory address from a value. Insert there or next to it if already assigned. Assigning and retrieval is then O(1). If you implement it in Java, you'll have the 4 byte overhead for each (primitive) object.

j13r
  • 2,576
  • 2
  • 21
  • 28
  • Thanks. Actually, the main issue with that is: 1) they are very very slow because my disk I/O is slow. 2) They have lots of tuning parameter, and it is not possible for me to tune for every DB. 3) They can't take memory advantage as I see in case of Tokyo Cabinet (may be after lots of tuning it works well), but machine differs means tuning parameter differs. SO, I want On-Memory implementation. load/store performance will not a issue. – Arpssss Apr 08 '12 at 17:09
  • 1) There are pure in-memory DBs such as MemcacheDB, Redis, stuff suggested here http://stackoverflow.com/questions/3122652/c-in-memory-key-value-stores http://stackoverflow.com/questions/2574689/looking-for-a-lightweight-java-compatible-in-memory-key-value-store 2) I understand that complexity can drive you to a custom solution. – j13r Apr 08 '12 at 17:18
  • here are more DB links: http://nosql-database.org/ (scroll down to Key Value / Tuple Store) – j13r Apr 08 '12 at 17:23
  • thanks a lot. Is C has any inbuilt MultiMap which has no such memory overhead? I know little bit C. – Arpssss Apr 08 '12 at 18:24
2

Based on @Tom Andersons solution I removed the need to allocate objects, and added a performance test.

import java.util.Arrays;
import java.util.Random;

public class LongIntParallelHashMultimap {
    private static final long NULL = Long.MIN_VALUE;

    private final long[] keys;
    private final int[] values;
    private int size;

    public LongIntParallelHashMultimap(int capacity) {
        keys = new long[capacity];
        values = new int[capacity];
        Arrays.fill(keys, NULL);
    }

    public void put(long key, int value) {
        if (key == NULL) throw new IllegalArgumentException("key cannot be " + NULL);
        if (size == keys.length) throw new IllegalStateException("map is full");

        int index = indexFor(key);
        while (keys[index] != NULL) {
            index = successor(index);
        }
        keys[index] = key;
        values[index] = value;
        ++size;
    }

    public int get(long key, int[] hits) {
        if (key == NULL) throw new IllegalArgumentException("key cannot be " + NULL);

        int index = indexFor(key);

        int hitIndex = 0;

        while (keys[index] != NULL) {
            if (keys[index] == key) {
                hits[hitIndex] = values[index];
                ++hitIndex;
                if (hitIndex == hits.length)
                    break;
            }
            index = successor(index);
        }

        return hitIndex;
    }

    private int indexFor(long key) {
        return Math.abs((int) (key % keys.length));
    }

    private int successor(int index) {
        index++;
        return index >= keys.length ? index - keys.length : index;
    }

    public int size() {
        return size;
    }

    public static class PerfTest {
        public static void main(String... args) {
            int values = 110* 1000 * 1000;
            long start0 = System.nanoTime();
            long[] keysValues = generateKeys(values);

            LongIntParallelHashMultimap map = new LongIntParallelHashMultimap(222222227);
            long start = System.nanoTime();
            addKeyValues(values, keysValues, map);
            long mid = System.nanoTime();
            int sum = lookUpKeyValues(values, keysValues, map);
            long time = System.nanoTime();
            System.out.printf("Generated %.1f M keys/s, Added %.1f M/s and looked up %.1f M/s%n",
                    values * 1e3 / (start - start0), values * 1e3 / (mid - start), values * 1e3 / (time - mid));
            System.out.println("Expected " + values + " got " + sum);
        }

        private static long[] generateKeys(int values) {
            Random rand = new Random();
            long[] keysValues = new long[values];
            for (int i = 0; i < values; i++)
                keysValues[i] = rand.nextLong();
            return keysValues;
        }

        private static void addKeyValues(int values, long[] keysValues, LongIntParallelHashMultimap map) {
            for (int i = 0; i < values; i++) {
                map.put(keysValues[i], i);
            }
            assert map.size() == values;
        }

        private static int lookUpKeyValues(int values, long[] keysValues, LongIntParallelHashMultimap map) {
            int[] found = new int[8];
            int sum = 0;
            for (int i = 0; i < values; i++) {
                sum += map.get(keysValues[i], found);
            }
            return sum;
        }
    }
}

prints

Generated 34.8 M keys/s, Added 11.1 M/s and looked up 7.6 M/s

Run on an 3.8 GHz i7 with Java 7 update 3.

This is much slower than the previous test because you are accessing main memory, rather than the cache at random. This is really a test of the speed of your memory. The writes are faster because they can be performed asynchronously to main memory.


Using this collection

final SetMultimap<Long, Integer> map = Multimaps.newSetMultimap(
        TDecorators.wrap(new TLongObjectHashMap<Collection<Integer>>()),
        new Supplier<Set<Integer>>() {
            public Set<Integer> get() {
                return TDecorators.wrap(new TIntHashSet());
            }
        });

Running the same test with 50 million entries (which used about 16 GB) and -mx20g I go the following result.

 Generated 47.2 M keys/s, Added 0.5 M/s and looked up 0.7 M/s

For 110 M entries you will need about 35 GB of memory and a machine 10 x faster than mine (3.8 GHz) to perform 5 million adds per second.

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
  • Sorry, I do not get the code. Another thing is: Tom's solution has issues like: 1) Gives worst performance when all query matches and when 2) when more keys have duplicates. Check this textuploader.com/?p=6&id=hzGIp code which takes more than 10 minutes to finish just 5 million put/get. – Arpssss Apr 09 '12 at 07:50
  • Is this code solves this issues ? Another point is: I want get returned result as set of values. – Arpssss Apr 09 '12 at 07:59
  • Creating the Set is likely to be more expensive than the lookup, but you can do that. – Peter Lawrey Apr 09 '12 at 08:23
  • You are pushing the limits of your memory architecture, I would make sure you have the fastest memory you can afford for this type of solution. However, I suspect you will have large bottle necks elsewhere in your application which will make this portion of code unimportant. e.g. do you do any network IO at any point? ;) – Peter Lawrey Apr 09 '12 at 08:25
  • The cost of more keys matching is very small as the main cost is filling the cache line from main memory. Having 8 return values could be much the same as having none. You have to think in terms of how you are accessing main memory not lines of code. – Peter Lawrey Apr 09 '12 at 08:27
  • I have 8GB RAM, 2.1 GHz processor and I am running on it. Another point is: this code inserts first keys set and then values set. I have to insert (key,Value) every time and also access values set using get(key) as done by Tom. I found Tom's solution is quite nice for inserting/accessing if no-duplicates/match found. I uploaded that test code here http://textuploader.com/?p=6&id=hzGIp. Have you find his code nice ? Or your proposed code (without different inserting) nice ? – Arpssss Apr 09 '12 at 08:42
  • 1
    Its not going to be nice in any case because you have such a high performance requirement. You are not going to be able to create a Set for each key because that alone will take about 40 bytes per key. Unless you have fast memory you are going to be at a disadvantage. BTW: `b = (long) Math.floor(i/2) + 1` is the same as `b = i/2 + 1` – Peter Lawrey Apr 09 '12 at 08:52
  • I would suggest something like an i7 overclocked to 4.5 GHz with 32 GB of 1600 MHz memory (cost less than $1000 for the CPU, MB and memory) I hope you can see that without the right hardware your solution is going to be half the speed or worse. – Peter Lawrey Apr 09 '12 at 08:54
  • Hmm. But, for now I think to do some further tests with Louis Wasserman solution http://stackoverflow.com/questions/10056135/java-hash-multi-map-key-with-multiple-values-implementation. – Arpssss Apr 09 '12 at 09:03
  • 1
    I have tried this collection for the same test and added my results to the end. – Peter Lawrey Apr 09 '12 at 09:30
  • OMG. Thanks for saving my time. BTW, can you kindly modify your code with functionality like TOM ? I want to check performance of your code also using this textuploader.com/?p=6&id=hzGIp. – Arpssss Apr 09 '12 at 09:36
  • The main performance difference is re-using the array instead of creating a new one each time, so no the apis cannot be the same. You need to change the caller to provide an array to populate (one you reuse) – Peter Lawrey Apr 09 '12 at 09:42
  • No Other Language (C,C++) MultiMap library also solves this issue ? C should be, because it not OOP. You know any inbuilt C/other library that solves this issue ? – Arpssss Apr 09 '12 at 09:46
  • There are plenty I imagine, However your problem is that none will provide a Java Set interface (which is the slowest part of your requirement) nor will they speed up your cpu-main memory bridge which is the second slowest. What they will do is have tested implementations of removeKey() which is not in the sample code, but you could add it. – Peter Lawrey Apr 09 '12 at 09:48
  • Ok. Can you give a nice (less memory footprint and fast) one link ? Another thing, are they able to solve it in less memory footprint means 4/5 GB (I can change server to solve other issues)? – Arpssss Apr 09 '12 at 09:53
  • @Tom and my example, it uses ~2.6 GB, so I don't see why they couldn't use less than 4/5 GB. I don't know of a similar one in C++, but I have no doubt they exist. With the overhead of JNI I don't imagine they would be faster or simpler to use. – Peter Lawrey Apr 09 '12 at 09:58
  • Sorry. I forgot another point. Is there any B+ Tree, R+ Tree, R Tree etc. based solution for it ? You know any one ? – Arpssss Apr 09 '12 at 10:14
  • You can try http://code.google.com/p/leveldb/ The benefit of using tree is that the keys are sorted. The disadvantage is that each lookup requires examining O(ln N) keys instead of O(1) keys so it is usually several times slower. – Peter Lawrey Apr 09 '12 at 10:38
0

If you must use Java, then implement your own hashtable/hashmap. An important property of your table is to use a linkedlist to handle collisions. Hence when you do a lookup, you may return all the elements on the list.

kasavbere
  • 5,873
  • 14
  • 49
  • 72
0

Might be I am late in answering this question but elastic search will solve your problem.

shaILU
  • 2,050
  • 3
  • 21
  • 40