4

I need to put about 20 million entries into a HashMap. I chose TLongObjectHashMap as per :Why is Java HashMap slowing down?

The code looks like:

StringBuilder sb = new StringBuilder("");
StringBuilder value = new StringBuilder("");
TLongObjectHashMap<String> map = new TLongObjectHashMap<String>();

in = new FileInputStream(new File(inputFile));
br = new BufferedReader(new InputStreamReader(in), 102400);
for (String inLine; (inLine = br.readLine()) != null;) {
    sb.setLength(0);
    for (i = 0; i < 2; i++) {
                for (j = 1; j < 12; j++) {
                    sb.append(record.charAt(j));
                }
            }

            for (k = 2; k < 4; k++) {
                value.append(record.charAt(k));
            }
            for (k = 7; k < 11; k++) {
                value.append(record.charAt(k));
            }
    map.put(Long.parseLong(sb.toString()), value.toString());
    value.delete(0, value.length());
}

I used the GNU Trove. Still, becomes extremely slow and almost stops at about 15 million entries. There is no OutOfMemoryError yet. What is the problem?

I have no option to use DB for this.

Note: the values like are 1, 12, 2,4, etc are calculated before this loop and stored in a variable, which in turn will be used here. I just replaced them with some values now

Community
  • 1
  • 1
Harbinger
  • 762
  • 2
  • 14
  • 36
  • what is the max heap size of the JVM where this is being run? – Vikdor Jan 06 '15 at 10:45
  • Have you tried turning on GC tracing to see how much time is being spent in GC? What is the purpose of your `value` variable (the `StringBuilder`)? – Jon Skeet Jan 06 '15 at 10:45
  • Twenty million entries _may_ be pushing the limits, have you thought about external storage, like a database? – paxdiablo Jan 06 '15 at 10:46
  • @JonSkeet "value" is the value for the map. – Harbinger Jan 06 '15 at 10:46
  • @paxdiablo No.I will not be allowed a DB. – Harbinger Jan 06 '15 at 10:47
  • Increasing the heap size won't help you since your application is leaking. Try 1- Declare are your variables private 2- Profile your project with [Jprofiler](https://www.ej-technologies.com/products/jprofiler/overview.html) and see which object is specifically leaking. – Payam Jan 06 '15 at 10:49
  • Well it's not the value - the string is the value. Why are you not just doing `map.put(Long.parseLong(inLine.substring(10, 14), inLine.substring(70, 84))`? You don't need those `StringBuilder` objects at all as far as I can see. – Jon Skeet Jan 06 '15 at 10:49
  • What is size of file which you are reading? Reading a large file may also cause to slow. – pmverma Jan 06 '15 at 10:50
  • @pmverma its around 450 mb – Harbinger Jan 06 '15 at 10:53
  • Sounds like hazelcast will help you. 20 million records sounds like a lot, plus in a single jvm that's a single point of failure. What about a LRU hashmap with a smaller size? – vikingsteve Jan 06 '15 at 10:55
  • I would rather half the file and check whether it is caused by large file reading. – pmverma Jan 06 '15 at 10:57
  • @JonSkeet : there are few lines of code in between.. I shortened them. i have updated now for you. This is why used string builder – Harbinger Jan 06 '15 at 10:58
  • @Payam: this is inside a method and all are local varibles. – Harbinger Jan 06 '15 at 10:58
  • That looks like a complicated way of doing `record.substring(1, 12) + record.substring(2, 4) + record.substring(7, 11)` to me - with an extra loop (with `i`) for no obvious reason. – Jon Skeet Jan 06 '15 at 11:00
  • @JonSkeet: the values like are 1, 12, 2,4, etc are calculated before this loop and stored in a variable, which in turn will be used here. I just replaced them with some values now. – Harbinger Jan 06 '15 at 11:02
  • @JonSkeet: used the charAt as per http://stackoverflow.com/questions/27522936/using-single-stringbuilder-throws-outofmemeoryexception/27523060#27523060 – Harbinger Jan 06 '15 at 11:03
  • That doesn't change whether using substring would be simpler... basically you've got a couple of very long-lived `StringBuilder` objects for no obvious reason. I question the advice that you received in the other question - the garbage collector is very good at collecting very short-lived objects. – Jon Skeet Jan 06 '15 at 11:03
  • @JonSkeet I make their length as 0 for each loop. Does long lived SB will impact the speed? – Harbinger Jan 06 '15 at 11:05
  • each time you do "value.append(inLine.substring(70,84));" your StringBuilder is getting bigger and bigger. is this right? – pmverma Jan 06 '15 at 11:06
  • Yes, I know you clear it out, but even so, it's definitely not how I would approach things. I don't know whether it's affecting the speed, but I'd stick with simpler code personally. How much more have you allocated to your VM? You've got really quite a lot of data here... – Jon Skeet Jan 06 '15 at 11:07
  • @pmverma I set length to 0 at the beginning of each loop – Harbinger Jan 06 '15 at 11:08
  • @JonSkeet: It's only 268435456 bytes. But, it would atleast throw OutOfMemoryError if memory is the issue here. Am I wrong? – Harbinger Jan 06 '15 at 11:09
  • No, it won't necessarily. As you generate more garbage, the GC will have to work harder to find empty memory, even if it's available. That's why my very first comment was suggesting that you turn on GC tracing to see how much of your time is spent in garbage collection. I haven't seen any indication yet that you've tried that. – Jon Skeet Jan 06 '15 at 11:10
  • @JonSkeet just now seen.. I already made the argument -Xmx512m.. in IDE – Harbinger Jan 06 '15 at 11:12
  • If you are sure about data, you should also provide the capacity for HashMap. By doing this,VM do not need every time to reallocate the memory for each of your 20 million data. – pmverma Jan 06 '15 at 11:12
  • @pmverma I wont be able to guess that. The size may grow. – Harbinger Jan 06 '15 at 11:13
  • 3
    That's not GC tracing. That's just setting the VM's max heap to 512M, which I suspect won't be enough. Use `-Xloggc:gc.txt` to log to a file. – Jon Skeet Jan 06 '15 at 11:14
  • 1
    Of course, but you can guess, if you are pushing 20 million times, the initial capacity you can set around 2000000. This will pre-allocated before inserting data each time. Will reduce the performance issue in my view. – pmverma Jan 06 '15 at 11:17
  • I don't know why you're talking about, and tagging as,`HashMap` when youre not using one. – user207421 Jan 06 '15 at 11:39
  • Instead of writing `value.append(inLine.substring(70,84))`, you can write `value.append(inLine, 70, 84)` and it will do something smarter than `substring`. See http://docs.oracle.com/javase/7/docs/api/java/lang/StringBuilder.html#append(java.lang.CharSequence,%20int,%20int) . – Louis Wasserman Jan 06 '15 at 19:19

2 Answers2

4

I used the GNU Trove. Still, becomes extremely slow and almost stops at about 15 million entries. There is no OutOfMemoryError yet. What is the problem?

The problem is that you're making assumptions and not verifying them.

And you're not profiling your code. Your real code, not the half-edited stuff that you've posted here (hint: when the variable names don't match up, it's obvious that it's not the real code).

Yes, you're writing inefficient code. Those loops for copying characters, for example, duplicate String.substring(). You've already been told that. but it was buried in the mass of comments and you probably missed it. Another good comment was to use simple concatenation of those substrings, rather than mucking around with StringBuilder.

But the real problem is assuming that your map is inefficient, based on something that you read on the internet, and have done nothing to challenge that assumption. I can guarantee that the time taken to read records from disk is far greater than the time to insert one value in the map for each record.

What you need to do is to prove that to yourself. Profiling your code is the best way to do this, but you can also separate out the pieces of the program. Use a simple loop like the following to get a sense of how fast your map really is (I used HashMap because I don't have the Trove library installed; it took approximately 2 minutes to fill the map with 100,000,000 entries). I'll leave it to you to write a similar test for reading data from your file.

private static Map<Long,String> fillMap(int items)
{
    Map<Long,String> map = new HashMap<Long,String>(items);
    Random rnd = new Random();

    long start = System.currentTimeMillis();

    for (int ii = 0 ; ii < items ; ii++)
    {
        map.put(new Long(rnd.nextLong()), new String("123456789012345678901234567890"));
    }

    long finish = System.currentTimeMillis();
    double elapsed = ((finish - start) / 1000.0);
    System.out.format("time to produce %d items: %8.3f seconds (map size = %d)\n", items, elapsed, map.size());
    return map;
}
guest
  • 56
  • 1
0

I do not belive JDK built-in HashMap can not deal with this. There are 2 problems I see

  • Constant rehashing of the map as it grows
  • String builder objects that are not nessasary

Rehashing takes place when underlying storage array load factor reaches 75%

DEFAULT_INITIAL_CAPACITY = 16;  
DEFAULT_LOAD_FACTOR = 0.75;  
THRESHOLD = DEFAULT_INITIAL_CAPACITY * DEFAULT_LOAD_FACTOR;

I assume following is exponentially less work and does the same

double expected_maximal_number_of_data = 30000000d;
int capacity = (int) ((expected_maximal_number_of_data)/0.75+1);
HashMap<Long, String> map = new HashMap<Long, String>(capacity);
for (String inLine; (inLine = br.readLine()) != null;) {
    Long key = Long.parseLong(record.substring(1, 12));
    String value = record.substring(2, 4) + record.substring(7, 11);
    map.put(key, value);
}

If your computer has 2gb ram you should not have a problem, estimated completion time is <16s.

Margus
  • 19,694
  • 14
  • 55
  • 103
  • It is *more* work actually. `put()` already checks whether the item is present or not. – user207421 Jan 06 '15 at 11:41
  • @EJP That is not the point, you should use **map.key()** instead anyway. I just tried to give subtle hint he should mention what to do with duplicates or why use hashmap in the first place. – Margus Jan 06 '15 at 11:48
  • @Margus There are no duplicates in the inputs for sure. They are already removed. – Harbinger Jan 06 '15 at 11:50
  • @Margus creating a long and String for each loop? – Harbinger Jan 06 '15 at 11:52
  • @Harbinger Yes, I can do that. You can declare them beforehand as well, but it literally makes no amortized time difference. Note also that expected_maximal_number_of_data should be bigger then your assumption. In order to avoid even 1 rehash set it to ex: 30m. – Margus Jan 06 '15 at 12:04
  • In any case if you find this approach slow, you should use a database like Java DB that is Oracle's supported version of Apache Derby http://docs.oracle.com/javase/tutorial/jdbc/ – Margus Jan 06 '15 at 12:20
  • So what *is* the point exactly? Why are you calling `contains()` redundantly? And why are you claiming this code is exponentially quicker? And what exactly is `map.key()`? – user207421 Jan 07 '15 at 00:49