3

I am currently writing some code in java meant to be a little framework for a project which revolves around a database with some billions of entries. I want to keep it high-level and the data retriueved from the database shoud be easily usable for statistic inference. I am resolved to use the Map interface in this project.

a core concept is mapping the attributes ("columns in the database") to values ("cells") when handling single datasets (with which I mean a columns in the database) for readable code: I use enum objects (named "Attribute") for the attribute types, which means mapping <Attribute, String>, because the data elements are all String (also not very large, maximum 40 characters or so). There are 15 columns, so there are 15 enums, and the maps will have only so much entries, or less.

So it appears, I will be having a very large number of Map objects floating around, at times, but with comparatively little payload (15-). My goal is to not make the memory explode due to the implementation memory overhead, compared to the actual payload. (Stretch goal: do the same with cpu usage ;] )

I was not really familiar with all the different implementations of Java Collections to date, and when the problem dawned at me today, I looked into my to-date all-time favorite 'HashMap', and was not happy with how much memory overhead there was declared. I am sure, that additonal to the standard implementations, there are a number of implementations not shipped with Java. Googling my case brought not up much of a result, So I am asking you:

Do you know a good implementation of Map for my use case (low entry count, low value size, enumerable keys, ...)

I hope I made my use case clear, and am anxious for your input =) Thanks a lot!


Stretch answer goal, absolutely optional and only if you got the time and knowledge: What other implementations of collections are suitable for:

  • handling attribute (the String things) vectors, and matrices for inference data (counts/probabilities) (Matrices: here I am really clueless for now, Did really no serious math work with java to date)
  • math libraries for statistical inference, see above
simlei
  • 155
  • 2
  • 12
  • I don't think this answers your question directly, but it might be worth your time taking a look at the Google Guava collections framework https://code.google.com/p/guava-libraries/ as they have some really nice Map implementations. – David May 17 '13 at 12:59
  • take a look at memcached http://www.javaworld.com/javaworld/jw-04-2012/120418-memcached-for-java-enterprise-performance.html – fmodos May 17 '13 at 13:00
  • 1
    You want to keep billions of entries loaded into memory at all times? I don't think you actually want to use any java Map implementations. You most likely want a map backed by some sort of storage system that stores values outside your current VMs memory. This most likely will involve not using a java Map at all. – Deadron May 17 '13 at 17:24

3 Answers3

7

Use EnumMap, this is the best map implementation if you have enums as key, for both performance and memory usage.

The trick is that this map implementation is the only one that that does not store the keys, it only needs a single array with the values (similar to an ArrayList of the values). There is only a little bit of overhead if there are keys that are not mapped to a value, but in most cases this won't be a problem because enums usually do not have too many instances.

Compared to HashMap, you additionally get a predictable iteration order for free.

Philipp Wendler
  • 11,184
  • 7
  • 52
  • 87
5

Since you start off saying you want to store lots of data, eventually, you'll also want to access/modify that data. There are many high performance libraries out there.

Look at

When you find a bottleneck, you can switch to using a lower level API (more efficient)

You'll many more choices if look a bit more: What is the most efficient Java Collections library?

EDIT: if your strings are not unique, you could save significant amounts of memory using String.intern() : Is it good practice to use java.lang.String.intern()?

Community
  • 1
  • 1
Tom Carchrae
  • 6,398
  • 2
  • 37
  • 36
  • Thanks for this selection of libs, I will look into them. Fastutil was also mentioned above, looks really good =) – simlei May 17 '13 at 14:05
3

You can squeeze out a bit of memory with a simple map implementation that uses two array lists (keys and values). For larger maps, that is going to mean insertion and look up speeds become much slower because you have to scan the entire list. However, for small maps it is actually faster this way since you don't have to calculate any hashcodes and only have to look at a small number of entries.

If you need an implementation, take a look at my SimpleMap in my jsonj project: https://github.com/jillesvangurp/jsonj/blob/master/src/main/java/com/github/jsonj/SimpleMap.java

Jilles van Gurp
  • 7,927
  • 4
  • 38
  • 46
  • Own implementation - I like the idea. Did not come to my mind before, to be honest, but I guess that would be the option with the most control over performance. – simlei May 17 '13 at 13:11
  • This one definitely uses a lot less memory. If you want something more mainstream, Trove or Guava might have some OK implementations. BTW. using object serialization gives you a good idea how much memory you are saving. Simply serialize to a bytearrayoutputstream and count the number of bytes. – Jilles van Gurp May 17 '13 at 13:16
  • [Fastutil](http://fastutil.di.unimi.it/) has an implementation of this concept, including specific versions for primitive types to avoid boxing/unboxing. Look up the appropriate TypeA2TypeBArrayMap. – Ian Roberts May 17 '13 at 13:20
  • Fastutil indeed looks promising. Thanks for the hint to that! – simlei May 17 '13 at 14:04
  • I thank you all for answering; I will use EnumMap, since that seems to be a class which was written for my use case, and it does not require linking to an external library. – simlei May 18 '13 at 12:12