Read a big file to count the number of words repeat K times

Question

Problem

There is a huge file (10GB), one have to read the file and print out the number of words repeat exactly k times in the file

My Solution

Use ifstream to read the file word by word;
Insert the word into a map std::map<std::string, long> mp; mp[word] += 1;
Once file is read, find all the words in the map to get the words occuring k times

Question

How can multi-thread is used to read the file effectively [read by chunk]? OR Any method to improve the read speed.
Is there any better data structure other than map can be employed to find the output effectively?

File info

each line can be maximum of 500 words length
each word can be maximum of 100 char length

dascandy · Accepted Answer · 2017-06-20T09:50:42.320

How can multi-thread is used to read the file effectively [read by chunk]? OR Any method to improve the read speed.

I've been trying out the actual results and it's a good thing to multithread, unlike my previous advice here. The un-threaded variant runs in 1m44,711s, the 4-thread one (on 4 cores) runs in 0m31,559s and the 8-thread one (on 4 cores + HT) runs in 0m23,435s. Major improvement then - almost a factor of 5 in speedup.

So, how do you split up the workload? Split it into N chunks (n == thread count) and have each thread except for the first seek to the first non-word character first. This is the start of their logical chunk. Their logical chunk ends at their end boundary, rounded up to the first non-word character after that point.

Process these blocks in parallel, sync them all to one thread, and then make that thread do the merge of results.

Best next thing you can do to improve speed of reading is to ensure you don't copy data when possible. Read through a memory-mapped file and find strings by keeping pointers or indices to the start and end, instead of accumulating bytes.

Is there any better data structure other than map can be employed to find the output effectively?

Well, because I don't think you'll be using the order, unordered_map is a better choice. I would also make it an unordered_map<std::string_view, size_t> - string_view copies it even less than string would.

On profiling I find that 53% of time is spent in finding the exact bucket that holds a given word.

`std::string_view` is c++17. You can also use boost::string_view, or take the hit & copy anyway. — dascandy, Jun 20 '17 at 06:39

Some programmer dude · Answer 2 · 2017-06-20T06:33:18.837

1

If you have a 64-bit system then you can memory-map the file, and use e.g. this solution to read from memory.

Combine with the answer from dascandy regarding std::unordered_map and std::string_view (if you have it), and you should be as fast as you can get in a single thread. You could use std::unordered_multiset instead of std::unordered_map, which one is "faster" I don't know.

Using threads is simple, just do what you do know, but each thread handles only part of the file. Merge the maps after all threads are done. But when you split the file into chunks for each thread, then you risk splitting words in the middle. Handling this is not trivial.

edited Jun 20 '17 at 06:33

answered Jun 20 '17 at 06:25

Some programmer dude

400,186
35
402
621

.. Right. There are systems that don't have 10 GB of address space. `unordered_multiset` won't be faster as it has to keep track of more data. If you do want to do multithreaded, split it into N chunks (n == thread count) and have each thread but the first seek to the first non-word character first. Then make all handle words including those starting at their end (except for the last block). Then when they all finish, merge and sync them. Note that processing 10GB will take order of size 1 second (assuming it's loaded) and threading will add about 2 seconds of overhead. – dascandy Jun 20 '17 at 06:53

Read a big file to count the number of words repeat K times

2 Answers2