I have a database record of around 1000000 paragraphs with around ~500 characters each. By reading all the records, I need to get the list of alphabet ordered by most to least used.
I mock the database reading by creating stream up to 1000000 then process the stream in parallel
final Map<Character, Long> charCountMap = new ConcurrentHashMap<>();
for (char c = 'a'; c <= 'z'; c++) {
charCountMap.put(c, 0l);
}
System.out.println("Parallel Stream");
long start = System.currentTimeMillis();
Stream.iterate(0, i -> i).limit(1000000).parallel() //mock database stream
.forEach(i-> RandomStringUtils.randomAlphanumeric(500)
.toLowerCase().chars().mapToObj(c -> Character.valueOf((char) c)).filter(c -> c >= 97 && c <= 122)
.forEach(c -> charCountMap.compute(c, (k, v) -> v + 1))); //update ConcurrentHashMap
long end = System.currentTimeMillis();
System.out.println("Parallel Stream time spent :" + (end - start));
System.out.println("Serial Stream"); start = System.currentTimeMillis();
Stream.iterate(0, i -> i).limit(1000000) //mock database stream
.forEach(i-> RandomStringUtils.randomAlphanumeric(500)
.toLowerCase().chars().mapToObj(c -> Character.valueOf((char) c)).filter(c -> c >= 97 && c <= 122)
.forEach(c -> charCountMap.compute(c, (k, v) -> v + 1)));
end = System.currentTimeMillis();
System.out.println("Serial Stream time spent :" + (end - start));
I initially thought that parallel stream would be faster even with expected overhead for stream larger than 100,000. However, test shows that serial stream is ~5X faster than parallel even for 1,000,000 records.
I suspected it was because of updating the ConcurrentHashMap. But when I removed it and change with empty function, there is still significant performance gap.
Is there something wrong in my database mock up call or the way I use parallel stream?