34

I want to store 100 Million terms and their frequencies (in a text database ) into a HashMap <String, Double>. It is giving me "Out of Memory" Error. I tried to increase the heap-space to -Xmx15000M. However it runs half an hour then again throw the same exception. The file size from which I'm trying to read the words and frequencies is 1.7GB.

Any help would be much appreciated.

Thanks :-)

rlb.usa
  • 14,942
  • 16
  • 80
  • 128
ablimit
  • 2,301
  • 6
  • 27
  • 41
  • Are you running a 32 bit or a 64 bit JVM ? – nos Nov 02 '10 at 17:47
  • 38
    What on earth are you doing that requires 100 million terms? Are you working for Google? – DJClayworth Nov 02 '10 at 18:02
  • Why do you want to store it in HashMap in the first place? As many have suggested you can store in database, you may want to map reduce it (Hadoop?). Although it would entirely depend on why HashMap. – ch4nd4n Nov 02 '10 at 18:02
  • 1
    How many distinct terms are there? If there are many duplicates then it's possible that while the volume of data is too big for memory, the frequency table could still be a reasonable size. In that case, it's just a problem of processing the full file in stages.... – mikera Nov 02 '10 at 18:16
  • 1
    Duplicate of [Java HashMap performance optimization / alternative](http://stackoverflow.com/questions/1757363/java-hashmap-performance-optimization-alternative) Use a database. – BalusC Nov 02 '10 at 18:25
  • I'm also implementing another clustering algorithm which needs to know the frequencies to cluster these terms into groups. If I use database I think the database connection would be a bottleneck of the clustering algorithm. That's why I'm not using Relational database as my backed. I'm not using MapReduce which I don't have cluster to run. – ablimit Nov 02 '10 at 21:29
  • So you have two alternatives: database, which runs slowly, or in-memory which doesn't work at all. Which one do you think you should choose? – DJClayworth Nov 03 '10 at 13:52
  • I end up using Berkeley DB as the final resort. I point some part of memory as database directory (instead of using disk) for the Berkeley DB which may provide faster look-up. – ablimit Nov 03 '10 at 18:18
  • How do your terms look like? Because the word term implies something small and the natural language is kind of limited here... so 100 mio is a lot. Some examples please. – ReneS Mar 16 '15 at 17:43
  • The default implementation of HashMap by Sun/Oracle is a node-based structure. That may be part of the reason why you run out of memory. If you create a different type of implementation without nodes, you can easily fit all this data into memory. One simple idea: Use a giant sorted array with binary search to find keys. – kevinarpe Oct 07 '15 at 01:46
  • I've worked with a similarly sized data set. The Pwned Passwords list has about 800 million entries, with a sha1 hash (40 hex characters), and a count for each entry. Working with a python dict (which uses a similar amount of storage to Java's HashMap), I was able to load a 100 million count subset of the data into about 19 GB of memory. This ran fine on a 32 GB laptop. – Brian Minton Dec 28 '21 at 15:04

15 Answers15

18

For word processing like that the answer is usually a tree rather than hashmap, if you can live with the longer lookup times. That structure is quite memory efficient for natural languages, where many words have common start strings.

Depending on the input, a Patricia tree might be even better.

(Also, if this is indeed words from a natural language, are you sure you really need 100,000,000 entries? The majority of commonly used words is surprisingly low, commercial solutions (word prediction, spelling correction) rarely use more than 100,000 words regardless of language.)

Daniel Zolnai
  • 16,487
  • 7
  • 59
  • 71
Christoffer
  • 12,712
  • 7
  • 37
  • 53
  • I tried the Patricia trie. This time I'm hitting the GC limit and 15GB memory is still not enough. :-) – ablimit Nov 03 '10 at 03:40
  • 5
    It pointed me to other solution and I learned a new library tool. It's hard pick the best one whereas all the answers pointed something useful. Best answer goes to one, but I'm very thankful to all earnest answerers here. I hope I can choose more than one best answer... – ablimit Nov 03 '10 at 18:24
  • 1
    Slovak has 1,000,000 words, because we use inflections a lot – Oliv May 25 '17 at 08:17
11

Your problem is that 1.7 GB raw Text is more than 1500 MB even without the overhead added by the individual string objects. For huge mappings you should either use a database or a file backed Map, these would use disk memory instead of heap.

Update

I don't think allocating 15 GB for the heap is possible for most jvms. It wont work with any 32bit jvm and I don't think that a 64bit jvm would work either. 15 GB of memory should work on a 64 bit jvm when enough RAM is available.

josefx
  • 15,506
  • 6
  • 38
  • 63
  • @nos it would be up to 3.4 GB – josefx Nov 02 '10 at 17:50
  • You could put a database in-memory to speed that up, but yeah, a database would be much, much more typical. – Dean J Nov 02 '10 at 19:21
  • 4
    @josefk: I know this post is old but you can allocate more than 15G on RAM to a single JVM process. I have tried it till 25GB and it works. Specifications: 64 core machine with 64 GB RAM and Sun JDK 6. – Sanjay T. Sharma Feb 03 '11 at 16:06
  • @Sanjay T. Sharma good to know, I don't have access to a 64 bit system so I couldn't check it and wrongly assumed that the heapsize would be limited due to system or jvm limitations. – josefx Feb 03 '11 at 17:46
  • You don't need to store the raw text in memory. You only need to store the unique terms which should be several orders of magnitude less data million and a corresponding int. Depending on the input data, we are probably talking 100,000 terms, or at most 1 million which is easy to store in a modern day computer. – Jay Askren Feb 25 '16 at 20:00
6

A 1.7 GB file is a relatively small file to do this with and store in RAM. I do this with much larger files and store them in memory without a problem. A database could be used, but may be overkill or may be perfect depending on what you plan to do with the data.

As others have said, with natural language, there will likely be a relatively small number of unique values, so the map will not actually get that large. I would not use a java.util.HashMap as it is is very inefficient in terms of memory usage especially when storing primitive values such as ints. java.util.HashMap stores primitives as objects. It also stores each value inside of a HashMap.Entry object which wastes memory. Because of these two factors, the java.util.HashMap uses much more memory than alternatives such as Trove, Fastutil and others:

As mentioned there are several map implementations which do not have these problems. Since you are storing numbers in your map, an extra benefit is that you will get a performance boost because there is no need to constantly switch between objects and primitives (i.e. boxing/unboxing) as you are putting new values in the map or updating old values. A benchmark of various primitive hashmaps that are better suited for large amounts of data can be found on this post at the Java Performance Tuning Guide:

Jay Askren
  • 10,282
  • 14
  • 53
  • 75
5

With 100 million terms you are almost certainly over the limit of what should be stored in-memory. Store your terms in a database of some kind. Either use a commercial database, or write something that allows you to access the file to get the information you want. If the file format you have doesn't let you quickly access the file then convert it to one that does - for example make each record a fixed size, so you can instantly calculate the file offset for any record number. Sorting the records will then allow you to do a binary search very quickly. You can also write code to hugely speed up access to the files without needing to store the whole file in memory.

DJClayworth
  • 26,349
  • 9
  • 53
  • 79
5

If you want just a lightweight KeyValue (Map) store, I would look into using Redis. It is very fast and has the ability to persist the data if it needs. The only downside is that you need to run the Redis store on a linux machine.

If you are limited to Windows, MongoDB is a good option if you can run it in 64bit.

Joshua
  • 26,234
  • 22
  • 77
  • 106
2

You could also try stemming to increase the number of duplicates.

For instance, cat = Cats = cats = Cat

or

swim = swimming = swims

try Googling "Porter Stemmer"

Ivan
  • 1,256
  • 1
  • 9
  • 16
1

Trove THashMap uses a lot less memory. Still, doubt if that would be enough of a reduction in size. You need somewhere else to store this information for retrieval besides strictly in memory.

AHungerArtist
  • 9,332
  • 17
  • 73
  • 109
  • 1
    Please don't recommend Trove, there are better options: http://java-performance.info/hashmap-overview-jdk-fastutil-goldman-sachs-hppc-koloboke-trove-january-2015/ – leventov Jul 07 '16 at 20:15
  • 1
    @leventov You realize this answer is six years old, right? At the time, it was a good option. It's good to have updated information, but just say that. – AHungerArtist Jul 10 '16 at 20:18
1

Other answers have already pointed out that the problem lies with memory usage. Depending on your problem domain you could design a key class that reduced the overall memory footprint. For example, if your key consists of natural language phrases you could separate and intern the words that make up the phrase; e.g.

public class Phrase {
  private final String[] interned;

  public Phrase(String phrase) {
    String[] tmp = phrase.split(phrase, "\\s");

    this.interned = new String[tmp.length];

    for (int i=0; i<tmp.length; ++i) {
      this.interned[i] = tmp[i].intern();
    }
  }

  public boolean equals(Object o) { /* TODO */ }
  public int hashCode() { /* TODO */ }
}

In fact this solution might work even if the Strings do not represent natural language, providing there is significant overlap that can be exploited between Strings.

Adamski
  • 54,009
  • 15
  • 113
  • 152
1

Drop the HashMap and load all that data into HBase or one of the other NoSQL datastores and write your queries in terms of MapReduce operations. This is the approach taken by Google Search and many other sites dealing with huge amounts of data. It has proven to scale to basically infinite size.

Barend
  • 17,296
  • 2
  • 61
  • 80
1

Consider replacing it with a cdb. Up to 4 GB and:

A successful lookup in a large database normally takes just two disk accesses. An unsuccessful lookup takes only one.

whiskeysierra
  • 5,030
  • 1
  • 29
  • 40
0

Back of the envelope: 1.7Gb/100M = avg 18 bytes = per term and freq

We can use a handcoded hashmap backed by two logical arrays.

  1. One to hold int frequencies (values) and the other is to build a C style char array to simulate a two dimensional c array (an array of char arrays). so we index by calculation. we cannot use a java two dimensional array since it comes with too much object overhead. This char array can hold fixed size char arrays to represent the keys. So we calculate the hash of the key and put it in this "two dimensional array" and if we have a conflict it can be resolved by say linear probing. key and value pairs are tied by the common index of the arrays.

  2. The hashmap has to use open addressing since we do not have enough memory for chaining.

  3. We can have say 10 instances of this hashmap based on the length of the keys; cannot be certain since I don't know the characteristics of data.

  4. Space used = 2 power 29 for int array + (2 power 4 (16 bytes per string) * 2 pow 27) = 3.5 gig

  5. If we want double frequencies instead of ints then we may need to reduce the size of the strings appropriately.

Jesse
  • 8,605
  • 7
  • 47
  • 57
0

Its a bad design. Having 1.7GB of data in memory on a HashMap, I would have done any of the two:

  1. Persist all the data (file/database) and have the top 1% or something in memory. Use some algorithm for deciding which IDs will be in memory and when.

  2. Use memcached. The easiest way out. An in-memory distributed hashable. This is exactly what DHTs are used for.

zengr
  • 38,346
  • 37
  • 130
  • 192
0

There is interesting offering from Terracotta - BigMemory which seems to be exactly what you're want. I haven't tried it myself and don't know licensing terms etc. though.

maximdim
  • 8,041
  • 3
  • 33
  • 48
0

In java, an object has overhead of 16 bytes as a minimum size before you consider what other content it holds.

1e8 items in a hash map has an underestimated size requirement of 1e8 * 2 * 16 Bytes, and that is assuming your keys and values are Numbers so that requires a few GB of heap available in your heap and from your computer.

A string is an object holding a character array so your strings as mentioned by many above may be larger than a Double object for example, hence you would need more memory available to the heap.

Note that programs begin to perform poorly when you near the limit of your computer too.

If you are not wanting to use a database as suggested above, you could consider encoding and compressing your keys to make them into numbers that you can still count the frequency of. You could choose an entropy based encoding based upon the frequency of words in that first encoding and go from there...

nichole
  • 121
  • 1
  • 6
-1

For the reason why it failed, I would agree with the above answers.

DB is good choice.. But even comercial level of DB, they would also suggest 'Partitioning' the data to do effective action.

Depending on your environment, I might suggest to use distribute your data multiple nodes that connedte through LAN. Based on the Key value,

Node 01 has key starting with 'a' Node 02 has key starging with 'b'....

So your program suddenly changed to network programming..

exiter2000
  • 548
  • 5
  • 14
  • 1
    Oh, come on. 100 million rows is small change for a proper database server. No sense to go to parittioning here. I have tables with 10 times that amount of data without any performance problems. – TomTom Nov 02 '10 at 20:08