0

Edit: number of elements to be displayed can be user defined and default to 10 but can be defined to be a very big number.

I have a file which I parse for words, then I need to count how many times each word appeared in the text and display the 10 words with the highs number of appearances(using C++).

I currently insert each parsed word into a std::map, the word is the key and the number of its appearances is the value. Each time I come across a word that is not part of the std::map I add it with the initial value of 1 and each time I come across a word that is part of the map I add 1 to its current value.

After I am done parsing the file I have a map with all the unique words in the text and the number of their appearances but the map is not sorted by the value of the keys.

At this point I can traverse the std::map and push its words into priority queue(ordered with min value at the top), Once the priority queue reaches maximum capacity of 10 words I check if the value I am about to insert is bigger then the value at the top and if so I pop the top and Insert the value(if not I move on to the next value from the std::map.

Because each word appears only once(at this stage) I know for sure that each value at the priority queue is unique.

My question is can this be done more efficiently in regards to complicity?

p3t3
  • 95
  • 8

1 Answers1

1

This is python's collections.Counter, so you could look there for a real-world example. It essentially does the same thing you are doing: get counts by incrementing a dictionary, then heapq.nlargest on the (word, count) pairs. (priority queue is a heap. I have no idea why they added a Q.)

Consider select m largest/smallest out of N words. This should have a theoretical limit of O(N log m)

You should create the counts in O(N) time with an std::unordered_map. This is important, you don't care about sorting the words alphabetically, so don't use std::map here. If you use std::map, you're already at O(N log N) which is greater than the theoretical limit.

Now, when selecting the top 10, you need pretty much any method that only looks at 10 items at a time. Priority queue with a max size is a good option. The important point is that you don't track more than you need to. Your complexity here is O(N log m), which becomes O(N) in the special case when n is small compared to N. But the common mistake would be to include the whole data set when comparing items.

However, check if m >= N, because if you do need the whole data set, you can just call std::sort. I'm assuming you need them in order. If you didn't, this case would become really trivial. And check m==1 so you can just use max.

In conclusion, except for using the wrong map, I believe you've already met the theoretical limit in terms of big O complexity.

Kenny Ostrom
  • 5,639
  • 2
  • 21
  • 30
  • I made a mistake that I just edited, number of elements to be displayed can be user defined and default to 10 but can be defined to be a very big number. – p3t3 Sep 25 '21 at 16:39
  • Okay, it's fundamentally N log N, but changing the original counter to unordered_map still benefits you, and keeping track of as few items as possible in the second phase is still wise. – Kenny Ostrom Sep 25 '21 at 16:49
  • I do agree about using unordered_map. Thanks! – p3t3 Sep 25 '21 at 17:14
  • You could also heapify the full list, O(N) then pop the m largest/smallest, but this is the same time complexity, using more space. – Kenny Ostrom Sep 25 '21 at 17:21
  • Here's a good reference: https://stackoverflow.com/questions/23038756/what-is-the-time-complexity-of-heapq-nlargest – Kenny Ostrom Sep 26 '21 at 12:30