In general this class of problems is the topic of "Top K" or "selection" algorithms. Here's a Wikipedia article on the general topic: Wikipedia: Selection algorithm. It seems to have come into vogue with "Big data" systems generally, and perhaps to get past a previous generation of interviews which focused on sorting algorithms long past the time when every serious candidate had memorized quicksort and heapsort code.
As a practical matter this is just about the textbook problem for which "Big Data" (Hadoop, and other Map/Reduce systems) were built. If the data is distributed over N nodes then each can compute separate partial histograms (mapping their histogram function over the entire data set) and merge their results (reducing their subtotals into grand totals).
For an interview scenario this is a popular question because there is no simple trick. You could enumerate over several published approaches in the academic literature or you could tackle the problem de novo.
If the "vocabulary" is relatively small (for example there are only a few tens of thousands of words in a typical English lexicon --- so a quarter million word vocabulary is pretty extensive). In that case we expect that the count could fit in RAM for typical modern hardware. If the "words" in this data set are more extensive --- more than tens or hundreds of millions --- then such an approach is no longer feasible.
Conceivably one could try an adaptive or statistical approach. If we know that there are no major clusters of any single "word" ... that any statistically significant sample of the data set is roughly similar to any other ... than we could build our histograms and throw away those "words" (and their counts) which are significantly more rare than others. If the data will only be presented as a stream and we are not given any hard guarantee about the distribution of terms then this is not a feasible approach. But if we have the data in some random access filesystem then we could possibly sample the data set sparsely and randomly to build a very likely set of top K * M (where M is some arbitrary multiple of our desired K elements such that it will all fit in RAM).
Hashing might help us locate our counters for each word, but we have to consider the chances of collisions if we try to keep counts of only the hashes without keeping the "words" themselves in the data structure. In general I'd think a heap would be better (possibly including dropping things from the bottom of the in-memory heap out into a storage heap or trie).
I said "adaptive" earlier because one could use caching (ultimately statistical modeling) to keep the currently most frequent "words" in RAM and shuffle the least frequent out to storage (to protect against some degenerate data set where the initially frequent "words" give way to some initially rare word which becomes more frequent as one digs deeper through the data set).
While a conversational exploration of these considerations might work well in some interviews, I'd suggest being familiar with the various sections of the Wikipedia article I've cited so that you can sketch out the psuedo-code to at least one or two of them and show that you do have some academic background in this material.
Absolutely do not neglect to discuss distributed processing in an interview where "Top K" class of questions is being posed. Do so even if only to clarify the constraints of the question being posed and to acknowledge that such problems have been the driving force for modern "big data" distributed processing systems.
Also here's a question on the same general topic: StackOverflow: The Most Efficient Way To Find Top K Frequent Words In A Big Word Sequence.