Read a list of URL Strings from a File and Find the top 10 most read URL's

Question

Ok,

I have been getting this question in a tonne of interviews and I think I need some help on solving it.

You have a tonne of URL's say as a String Array or read from a File. You now need to get the top ten most read ones, as in the Top Ten most Frequent URL's in the file.

My approach was:

         Read them into a String Array, 
         Iterate through each String/URL,
             At every Iteration, put them into Hashtable, incrementing the count.
         Iterate again and find feed the scores into an array
         Sort and find the top 10 scores OR use max-heap to get the top 10.
         Iterate again and remove the URL's with the top 10 scores.

Is this a very bad answer? Can someone help me further with this?

Related question, although it's across a network - [Find Top 10 Most Frequent visited URL, data is stored across network](http://stackoverflow.com/questions/17928158/find-top-10-most-frequent-visited-url-data-is-stored-accross-network). — Bernhard Barker, Sep 21 '13 at 19:21
Another related question - [Getting top 100 URL from a log file](http://stackoverflow.com/questions/10733983/getting-top-100-url-from-a-log-file). — Bernhard Barker, Sep 21 '13 at 19:24

Jim Mischel · Answer 1 · 2013-09-22T00:18:43.367

You can do this with minimal memory and with a file of essentially unlimited size:

Use the OS-supplied sort utility to sort the URLs on disk
Create a priority queue (binary heap, for example)
For each URL in the file
    if the URL is the same as the previous URL
        increase count
    else
        AddToQueue(previous_url, count)
        previous_url = current_url
        count = 1
EndFor
AddToQueue(previous_url, count)

At this point, the top 10 most visited URLs will be in the priority queue.

The AddToQueue function looks like this:

AddToQueue(url, count)
    if (queue.Count < 10)
        add new url with count to queue
    else if (count > top_of_queue.count)
        remove top URL from queue
        add new url with count to queue

If you have enough memory to load all the URLs, you can load them into an array and sort that. But if you have enough memory for all the URLs, then the dictionary based approach is probably faster.

score 0 · Answer 2 · answered Sep 21 '13 at 19:06

The runtime isn't bad, something like O(nlogn) overall.

Read them into a String Array - O(n)
Iterate through each String/URL - O(n)
    At every Iteration, put them into Hashtable, incrementing the count - O(1)
Iterate again and find feed the scores into an array - O(n)
Sort and find the top 10 scores OR use max-heap to get the top 10 - O(nlogn)
Iterate again and remove the URL's with the top 10 scores. - O(1)

However, you could probably skip the last 2 steps. Instead, iterate through the entries of the hashtable (or URLs), and while you are going through the entries maintain an array with 10 entries for the top 10 scores; that way you skip the sorting step, and the overall runtime will be O(n).

score 0 · Answer 3 · answered Sep 21 '13 at 19:26

First of all, note that you're using extra memory unnecessarily. Why read everything into an array and then iterate through that array to insert everything in a hash table? Unless you have a very strong reason to do so, you should just put it into the hashtable as you read it.

That cuts the need for an array, and drops memory usage by O(n). The other steps sound pretty reasonable. The idea of maintaining an array with 10 entries for the top 10 scores is a nice approach, but it poses the question of how to do so efficiently.

Also, using a hash table can raise implementation issues, you are relying too much on something that is typically in built-in libraries. For the sake of the interview, maybe a better approach would be to read everything to a binary search tree where each node holds a structure with the string, and the number of occurrences of that string (and pointer to left and right node). Binary search trees enable you to see if a string is there in O(log(n)) time. After reading everything and placing it in the tree, you can sort the tree with shell sort. Shell sort is a good option for this exercise because it tends to eliminate large amounts of disorder quickly. This solution runs in O(nlog(n)) time.

If your interviewer is ok with using a hash table, maybe it's not worth the trouble of implementing a tree, but by saying "I'll use a hash table" you can be shooting yourself in the foot, unless of course you implement the hash table. It really depends on the context.

It doesn't hurt to mention both the hash table and the BST and the advantages of each, especially in an interview. — Bernhard Barker, Sep 21 '13 at 19:35
There's nothing wrong with using something that's in built-in libraries. A BST is also typically in built-in libraries. If we're talking about implementing them oneself, to me, the implementations are of similar difficulty (or simplicity). I'd actually say the BST might be more difficult, especially given that BST's are reasonably useless if not self-balancing, and you basically need to remember how to do this as very few people would be able to figure it out from scratch in an interview. — Bernhard Barker, Sep 21 '13 at 19:37
In this case, it's lexicographically sorted, because it's used to check if a string is in there. You'd still have to sort it according to the number of times a string occurs in the end. — Filipe Gonçalves, Sep 21 '13 at 19:42
So I assume you don't mean "sort the tree", but rather "copy the tree into an array and sort that" (or something similar), since sorting a tree could be quite a challenge. — Bernhard Barker, Sep 21 '13 at 19:46

Read a list of URL Strings from a File and Find the top 10 most read URL's

3 Answers3