I have a file consisting of 7.6M lines. Each line is of the form: A,B,C,D where B,C,D are values that are used to calculate a level of importance for A which is a String identifier that is unique for each line. My approach:
private void read(String filename) throws Throwable {
BufferedReader br = new BufferedReader(new FileReader(filename));
Map<String, Double> mmap = new HashMap<>(10000000,0.8f);
String line;
long t0 = System.currentTimeMillis();
while ((line = br.readLine()) != null) {
split(line);
mmap.put(splitted[0], 0.0);
}
long t1 = System.currentTimeMillis();
br.close();
System.out.println("Completed in " + (t1 - t0)/1000.0 + " seconds");
}
private void split(String line) {
int idxComma, idxToken = 0, fromIndex = 0;
while ((idxComma = line.indexOf(delimiter, fromIndex)) != -1) {
splitted[idxToken++] = line.substring(fromIndex, idxComma);
fromIndex = idxComma + 1;
}
splitted[idxToken] = line.substring(fromIndex);
}
where the dummy value 0.0 is inserted for "profiling" purposes and splitted is a simple String array defined for the class. I initially worked with String's split() method, but found the above to be be faster.
When I run the above code, it takes 12 seconds to parse the file which is waaaay more than I think it should take. If I, e.g., replace the HashMap with a Vector of strings and just take the first entry from each line (i.e. I do not put an associated value with it as this should be amortized constant), the entire file can be read in less than 3 seconds.
This suggests to me that (i) there are a lot of collisions in the HashMap (I have tried to minimise the number of resizes by preallocating the size and setting the load factor accordingly) or (ii) the hashCode() function is somehow slow. I doubt its (ii) because if I use a HashSet the files can be read in under 4 seconds.
My question is: What could be the reason that the HashMap performs so slowly? Is the hashCode() insufficient for maps of this size, or is there something fundamentally that I have overlooked?