5

I have a file consisting of 7.6M lines. Each line is of the form: A,B,C,D where B,C,D are values that are used to calculate a level of importance for A which is a String identifier that is unique for each line. My approach:

private void read(String filename) throws Throwable {
        BufferedReader br  = new BufferedReader(new FileReader(filename));

        Map<String, Double> mmap = new HashMap<>(10000000,0.8f);
        String line;
        long t0 = System.currentTimeMillis();
        while ((line = br.readLine()) != null) {
            split(line);
            mmap.put(splitted[0], 0.0);
        }
        long t1 = System.currentTimeMillis();
        br.close();
        System.out.println("Completed in " + (t1 - t0)/1000.0 + " seconds");
}

private void split(String line) {
    int idxComma, idxToken = 0, fromIndex = 0;
    while ((idxComma = line.indexOf(delimiter, fromIndex)) != -1) {
        splitted[idxToken++] = line.substring(fromIndex, idxComma);
        fromIndex = idxComma + 1;
    }
    splitted[idxToken] = line.substring(fromIndex);
}

where the dummy value 0.0 is inserted for "profiling" purposes and splitted is a simple String array defined for the class. I initially worked with String's split() method, but found the above to be be faster.

When I run the above code, it takes 12 seconds to parse the file which is waaaay more than I think it should take. If I, e.g., replace the HashMap with a Vector of strings and just take the first entry from each line (i.e. I do not put an associated value with it as this should be amortized constant), the entire file can be read in less than 3 seconds.

This suggests to me that (i) there are a lot of collisions in the HashMap (I have tried to minimise the number of resizes by preallocating the size and setting the load factor accordingly) or (ii) the hashCode() function is somehow slow. I doubt its (ii) because if I use a HashSet the files can be read in under 4 seconds.

My question is: What could be the reason that the HashMap performs so slowly? Is the hashCode() insufficient for maps of this size, or is there something fundamentally that I have overlooked?

user1938803
  • 189
  • 2
  • 10
  • 1
    Try replacing your `0.0` dummy value with some static final constant. `0.0` is replaced by `Double.valueOf` which creates a new object every time. And in the `HashSet` only one preallocated dummy object is used. I'm not sure that's the reason, but it can be – esin88 Mar 01 '17 at 11:20
  • The final element of `splitted[]` will always hold the entire line. This is not what you want. – user207421 Mar 01 '17 at 11:30
  • `HashSet` is backed by `HashMap` internally, so the only difference is the auto-boxing of your dummy `0.0`. – bashnesnos Mar 01 '17 at 11:41
  • Could you separate io and tokenizing from map inserting, just to be sure you're not focusing on the wrong thing ? [micro benchmarking](http://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java) – Laurent G Mar 01 '17 at 12:03
  • 1
    `String.split()` is slower because it allocates a new regex `Pattern` on every call. Try creating a `private static final Pattern SPLITTER = Pattern.compile(",");` then `SPLITTER.split(line)`. – AngerClown Mar 01 '17 at 12:23

3 Answers3

3

HashMap vs Vector: Inserting in HashMap is way more costlier than inserting in Vector. Although both are amortized constant time operations, but the HashMap performs a number of other operations internally (like generating hashCode, checking collissions, resolving collissions, etc), whereas the Vector just inserts the element at the end (increasing the size of the structure, if required).

HashMap vs HashSet: HashSet internally uses HashMap. So, there shouldn't be any performance difference whatsoever if you use them for the same purpose. Ideally, both of these have different purposes, so the discussion regarding which is better is useless.

Since, you need B,C,D as value for A as key, you should definitely stick to HashMap. If you really want to just compare the performance, put "null" instead of 0.0 as the value for all the keys (because that is what HashSet uses while putting the keys in its backed HashMap).

Update: HashSet uses a dummy constant value (static final) to insert in the HashMap, and not null. Sorry about that. You can replace your 0.0 with any constant and the performance should be similar to HashSet.

iavanish
  • 509
  • 3
  • 8
2

You could use a more memory-efficient Collections library.

I suggest the Eclipse Collections ( https://www.eclipse.org/collections/ ), that has a ObjectDoubleMap ( https://www.eclipse.org/collections/javadoc/8.0.0/org/eclipse/collections/api/map/primitive/ObjectDoubleMap.html ), which is a map of object (String in your case) that has a double (yes, primitive double) as associated value. It is much better in handling memory and in performance.

You can get an empty instance of this by doing:

ObjectDoubleMaps.mutable.empty();
Boschi
  • 622
  • 3
  • 10
0

Yep, checked your example with 0.0 as dummy value VS static final constant as dummy value VS HashSet. That's rough comparison, for better precision i would recommend to use JHM tool, but my HashSet performance was pretty much the same as static constant as dummy performance.

So, most probably, that low performance is caused by wrapping your 0.0 dummy value for every line (it's replaced by Double.valueOf() during compilation, which explicitly creates a new Double object every time).

That would explain low performance, as HashSet has predefined static final dummy object (which is not null, btw).

esin88
  • 3,091
  • 30
  • 35