7

I am in the middle of a Java project which will be using a 'big dictionary' of words. By 'dictionary' I mean certain numbers (int) assigned to Strings. And by 'big' I mean a file of the order of 100 MB. The first solution that I came up with is probably the simplest possible. At initialization I read in the whole file and create a large HashMap which will be later used to look strings up.

Is there an efficient way to do it without the need of reading the whole file at initialization? Perhaps not, but what if the file is really large, let's say in the order of the RAM available? So basically I'm looking for a way to look things up efficiently in a large dictionary stored in memory.

Thanks for the answers so far, as a result I've realised I could be more specific in my question. As you've probably guessed the application is to do with text mining, in particular representing text in a form of a sparse vector (although some had other inventive ideas :)). So what is critical for usage is to be able to look strings up in the dictionary, obtain their keys as fast as possible. Initial overhead of 'reading' the dictionary file or indexing it into a database is not as important as long as the string look-up time is optimized. Again, let's assume that the dictionary size is big, comparable to the size of RAM available.

tsotsi
  • 683
  • 2
  • 8
  • 20
  • you can read the file in certain byte size, store it in a `HashMap` object, and then store this object as a bytestream object on your hard drive. Repeat until you have read the whole file. – Mohammad Najar Sep 29 '14 at 20:17
  • @Mohammad That doesn't really solve the use case where the input objects are larger than available memory. At the end of the day, you'll end up with a `HashMap` containing objects that are too many to fit. – Santa Sep 29 '14 at 20:18
  • 3
    [Memory-mapped files](https://en.wikipedia.org/wiki/Memory-mapped_file) may be helpful. Have a look at [FileChannel.map()](http://docs.oracle.com/javase/7/docs/api/java/nio/channels/FileChannel.html#map%28java.nio.channels.FileChannel.MapMode,%20long,%20long%29) and you can get a `FileChannel` from a `RandomAccessFile`. – Michał Kosmulski Sep 29 '14 at 20:19
  • ahh I see what you mean. The data retrieval is difficult in this case. – Mohammad Najar Sep 29 '14 at 20:19
  • 3
    A Trie (https://en.wikipedia.org/wiki/Trie) would be a (space-)efficient data structure for this use case. – Philipp Wendler Sep 29 '14 at 20:29
  • @PhilippWendler Sure, tries are compact and way faster than any DB can ever be. I guess, an immutable trie could be easily run using memory mapped files, so there'd be hardly any start-up overhead. – maaartinus Sep 30 '14 at 00:45
  • Put your strings in a single gigantic String (or several smaller, if the JVM has an object size limit) and build a "dope vector" of some sort to map between numbers and offsets in the gigantic String. Or do the same only leave the String data in the file and use random access to get the bytes. – Hot Licks Sep 30 '14 at 01:07
  • For some text mining applications an alternative to a giant dictionary could be use a hash function to map the words to integers. See [https://en.wikipedia.org/wiki/Feature_hashing] – florian Mar 16 '16 at 19:12

4 Answers4

4

Consider ChronicleMap (https://github.com/OpenHFT/Chronicle-Map) in a non-replicated mode. It is an off-heap Java Map implementation, or, from another point of view, a superlightweight NoSQL key-value store.

What it does useful for your task out of the box:

  • Persistance to disk via memory mapped files (see comment by Michał Kosmulski)
  • Lazy load (disk pages are loaded only on demand) -> fast startup
  • If your data volume is larger than available memory, operating system will unmap rarely used pages automatically.
  • Several JVMs can use the same map, because off-heap memory is shared on OS level. Useful if you does the processing within a map-reduce-like framework, e. g. Hadoop.
  • Strings are stored in UTF-8 form, -> ~50% memory savings if strings are mostly ASCII (as maaartinus noted)
  • int or long values takes just 4(8) bytes, like if you have primitive-specialized map implementation.
  • Very little per-entry memory overhead, much less than in standard HashMap and ConcurrentHashMap
  • Good configurable concurrency via lock striping, if you already need, or are going to parallelize text processing in future.
leventov
  • 14,760
  • 11
  • 69
  • 98
  • WOW. +1 from me! This is very cool, personally, this seems like something exactly like what he's looking for, if he doesn't want to commmit to a db. I'm curious what the draw backs of such "off-heap" implementations are, this is the first i've ever heard of them. Thank you for teaching me something truly amazing! :0) – Devarsh Desai Sep 30 '14 at 00:56
  • @DevarshDesai what prevents ChronicleMap from becoming a genuine silver bullet is that keys/values should be marshalled/demarshalled on each query. For `String`s, unfortunately, it impose the most severe overhead, because UTF-8 <-> `String` conversion is quite complicated. However, if string data can be stored as `byte[]`, the overhead is much lower, and for primitive keys/values there is almost no overhead. If keys/values are quite simple data objects of several primitive (or another data object) fields, ser/deser overhead could also be avoided by means of subclassing `Byteable` interface. – leventov Sep 30 '14 at 01:10
  • It's probably going to take me the rest of the day to digest what you just said, but once again- thank YOU! this is very very cool, and I'm not surprised one bit that this came up out of a HFT repository (this was my guess just before clicking on it); I'll definitely try to contribute to OpenHFT once I allocate some time to it! Hope you have a great week ahead! – Devarsh Desai Sep 30 '14 at 01:17
  • +1 Pretty cool thing. I'd probably still go for a trie as I have a clear idea how to do it, but `ChronicleMap` is surely a wonderful thing. @DevarshDesai In case it doesn't fit into memory, I'd go for two maps, where the smaller one would contain more frequent words. The reason is that with each word paged in, the OS brings in many others, which get rarely needed. – maaartinus Sep 30 '14 at 01:33
  • 1
    @maaartinus I agree about trie, the downside is that it should be implemented, unlike hash tables. I'm not aware about good open-source implementations, particularly off-heap. (Please point them out in your answer if you know some.) Hm, there is an idea to start `ChronicleTrie` :) – leventov Sep 30 '14 at 01:41
  • No idea about an existing implementation and I'm afraid, there are too many choices w.r.t. values (Set or Map), mutability (none, mutable values only, mutable everything), compression (use char, UTF8, pack even harder like I wrote in my answer). – maaartinus Sep 30 '14 at 02:46
  • One advantage hash map has over a trie is O (1) cache misses or page misses whereas a trie is O (log N) – Peter Lawrey Sep 30 '14 at 06:44
  • While StringBuilder to UTF8 is expensive for very small maps, it is not as expensive as one or two cache misses. You won't see much impact for short keys. – Peter Lawrey Sep 30 '14 at 06:46
  • @PeterLawrey well practically `O(log [typical word size])` is more `O(1)` than hash table's `O(1)` :) – leventov Sep 30 '14 at 06:53
  • @leventov this assumes the index fits into cache, ideally L1 cache. For a large structure, most of the steps of trie lookups will be cache misses which I think is part of the reason they are so much slower. – Peter Lawrey Sep 30 '14 at 07:12
  • @leventov: Thanks for this! I do appreciate the concurrency options which might come handy in the future. For a basic use though, on a single JVM, would you still say that it is a good solution in terms of how long one string look-up takes? – tsotsi Sep 30 '14 at 08:20
  • 2
    @PeterLawrey I believe, if you organize the trie properly, you'll never get more than one page miss: The part near the root will stay in memory and you'll need no more than one page for the tail. IMHO a trie could be useful when the memory is tight; the trie would still fit while the `HashMap` would not (as it's 3-10 times bigger? just a guess). And then you'd gladly trade a page miss for a few cache misses. – maaartinus Sep 30 '14 at 12:47
  • 1
    @tsotsi A string look up for a modest key size takes about 200 to 300 ns. It can jump to 5 micro-seconds 1% of the time, usually due to a TLB cache miss. – Peter Lawrey Sep 30 '14 at 13:19
2

At the point your data structure is a few hundred MB to orders of RAM, you're better off not initializing a data structure at run-time, but rather using a database which supports indexing(which most do these days). Indexing is going to be one of the only ways you can ensure the fastest retrieval of text once you're file gets so large and you're running up against the -Xmx settings of your JVM. This is because if your file is as large, or much larger than your maximum size settings, you're inevitably going to crash your JVM.

As for having to read the whole file at initialization. You're going to have to do this eventually so that you can efficiently search and analyze the text in your code. If you know that you're only going to be searching a certain portion of your file at a time, you can implement lazy loading. If not, you might as well bite the bullet and load your entire file into the DB in the beggenning. You can implement parallelism in this process, if there are other parts of your code execution that doesn't depend on this.

Please let me know if you have any questions!

Community
  • 1
  • 1
Devarsh Desai
  • 5,984
  • 3
  • 19
  • 21
  • Whatever you do, any database is dead slow when compared to a `HashMap`. Given that the OP speaks of 100 MB, it makes no sense at all. What's worse: If the strings don't fit in memory, then you're at the mercy of your OS and harddisk... losing 5+ orders of magnitude in speed (100 ns HashMap vs. 10 ms harddisk). Compressing the strings using a trie sound much faster. – maaartinus Sep 30 '14 at 00:41
  • 1
    In my experience, databases aren't actually that slow (with MongoDB i couldn't even tell the diff) in fact you can make them extremely quick if you're able to properly leverage the tools given by database internals. I've never seen a data structure that was 100 MB in size and he also mentioned "orders of RAM," at that point I would personally use a database, as others have suggested as well. I agree that having more things in-memory would make things much quicker, but i'm not operating under the assumption that the author of this post is going to go out and buy more memory for this problem. – Devarsh Desai Sep 30 '14 at 00:53
  • 1
    What you mean by *that slow*? I've found [this benchmark](https://blog.serverdensity.com/mongodb-benchmarks)... the access times are 50 microseconds when no disk gets involved, this means 3 orders of magnitude slower. The times with an SSD are 10 milliseconds. *There's nothing in a DB what could be any faster as a HashMap and there's quite some overhead.* Alone the [IPC](http://en.wikipedia.org/wiki/Inter-process_communication) costs way more than the lookup itself. – maaartinus Sep 30 '14 at 01:19
  • 2
    I never disputed the fact that a HashMap would be quicker. All I'm saying is that for it to be 100 MB is ALOT of memory. From my intuition and best practices, I would never use a HashMap for that amount of memory. From my experience working with garbage collections, database internals, etc, I could not recommend using a hashmap for allocating 100MB+ on the heap for a single instance of a hashmap. Don't get me wrong though, I **completely respect your argument,** in fact I think leventov's ChronicalMap is the best solution here if he doesn't want to do a db. – Devarsh Desai Sep 30 '14 at 01:28
  • 1
    Then we do agree. I wouldn't be scared of the GC as the data seem to never change. I must once try what happens. Actually, the standard `HashMap` is a memory hog, and so is Guava's `ImmutableHashMap`, but there are many `CompactHashMap`s lying around. – maaartinus Sep 30 '14 at 01:38
  • 1
    haha, yes we agree :0) a `CompactHashMap`, `ChronicleMap` would be great solutions to this problem. Thank you for a great discussion, walked out learning a lot :) Hope you have a great day ahead! – Devarsh Desai Sep 30 '14 at 02:03
2

As stated in a comment, a Trie will save you a lot of memory.

You should also consider using bytes instead of chars as this saves you a factor of 2 for plain ASCII text or when using your national charset as long as it has no more than 256 different letters.

At the first glance, combining this low-level optimization with tries makes no sense, as with them the node size is dominated by the pointers. But there's a way if you want to go low level.

So what is critical for usage is to be able to look strings up in the dictionary, obtain their keys as fast as possible.

Then forget any database, as they're damn slow when compared to HashMaps.

If it doesn't fit into memory, the cheapest solution is usually to get more of it. Otherwise, consider loading only the most common words and doing something slower for the others (e.g., a memory mapped file).


I was asked to point to a good tries implementation, especially off-heap. I'm not aware of any.

Assuming the OP needs no mutability, especially no mutability of keys, it all looks very simple.

I guess, the whole dictionary could be easily packed into a single ByteBuffer. Assuming mostly ASCII and with some bit hacking, an arrow would need 1 byte per arrow label character and 1-5 bytes for the child pointer. The child pointer would be relative (i.e., difference between the current node and the child), which would make most of them fit into a single byte when stored in a base 128 encoding.

I can only guess the total memory consumption, but I'd say, something like <4 bytes per word. The above compression would slow the lookup down, but still nowhere near what a single disk access needs.

leventov
  • 14,760
  • 11
  • 69
  • 98
maaartinus
  • 44,714
  • 32
  • 161
  • 320
0

It sounds too big to store in memory. Either store it in a relational database (easy, and with an index on the hash, fast), or a NoSQL solution, like Solr (small learning curve, very fast).

Although NoSQL is very fast, if you really want to tweak performance, and there are entries that are far more frequently looked up than others, consider using a limited size cache to hold the most recently used (say) 10000 lookups.

Bohemian
  • 412,405
  • 93
  • 575
  • 722