1

let me just say this is a HW project from UC Berkeley, so If you could give me a hint instead of the solution, It'd be great.

The problem is to create a data structure called "YearlyRecord" that maps a word to the number of times it is encountered in an ngram file. Here are the restrictions.

  • getting the count must be O(1)
  • getting the number of mappings must be O(1)
  • "put" must be on average O(log n)
  • getting the "rank" must be O(1)
  • counts, which provides a data structure of all of counts in increasing order
  • words, which provides a data structure of all words, in the order of counts

Note: "rank" is a function that takes a word and tells me its rank (e.g. 1, 2, 3, 4), which is based on the number of occurrences we have of it.

My solutions so far:

  • Hashmap (word -> count) provides O(1) getCount operation and O(1) size operation
  • make a private inner class that implements Comparator which compares based on the number of occurrences in the hashmap. Then use this comparator as input to a TreeSet of the words. This provides the "words" collection, which we can return in O(1) time.
  • Make another TreeSet of the counts. This provides an O(1) return of the counts in increasing order.
  • rank: Here is where I am stuck. It is clear that rank should be a map as well. I have the words in the TreeSet in increasing order of rank, but no indices to map to.
Willem Van Onsem
  • 443,496
  • 30
  • 428
  • 555
  • I don't see myself retrieving anything from it. "words" and "counts" gets the whole collection, that is O(1) because I just return the whole set. As far as O(lgn) "put," that is perfectly ok, I am actually constrained by a O(lgn) "put" on YearlyRecord – Sergey Kojoian May 30 '15 at 16:32
  • OK, I think I see what you mean. But you were not actually constrained on the performance of `counts` and `words`, were you? – RealSkeptic May 30 '15 at 16:46
  • no, the only constraints were on count, size, rank. The explicit ones at least. Making "counts" and "words" constant was my initiative. – Sergey Kojoian May 30 '15 at 17:08
  • just as a reminder, the retrieval of the count of a word is done through the hashmap – Sergey Kojoian May 30 '15 at 17:10
  • Yes, but when you return such data structures, you should return a *copy* of your internal structure, or whoever uses the reference can change it, and then you will give a bad data structure next time. – RealSkeptic May 30 '15 at 17:12
  • yea I know returning a view is a problem. I wanted to take this one step at a time, finish the rank problem, have it run as fast as I can, then go back and deal with the view. At this point, we can safely assume nobody is going to mess with the view – Sergey Kojoian May 30 '15 at 17:16
  • How do we determine the rank of a word if several words have the same number of occurrences? – kraskevich May 30 '15 at 18:35
  • We return any one of them, so say the first one in a list – Sergey Kojoian May 31 '15 at 05:30
  • possible duplicate of [Design a system to keep top k frequent words real time](http://stackoverflow.com/questions/21692624/design-a-system-to-keep-top-k-frequent-words-real-time) – rici May 31 '15 at 16:20
  • @SergeyKojoian: The answer to the linked question provides the outline of an algorithm which should help solve your problem. – rici May 31 '15 at 16:22

0 Answers0