Getting top 100 URL from a log file

Question

One of my friends was asked the following question in an interview. Can anyone tell me how to solve it?

We have a fairly large log file, about 5GB. Each line of the log file contains an url which a user has visited on our site. We want to figure out what's the most popular 100 urls visited by our users. How to do it?

score 7 · Answer 1 · answered May 24 '12 at 08:50

7

In case we have more than 10GB RAM, just do it straight forward with hashmap.

Otherwise, separate it into several files, using a hash function. And then process each file and get a top 5. With "top 5"s for each file, it will be easy to get an overall top 5.

Another solution can be sort it using any external sorting method. And then scan the file once to count each occurrence. In the process, you don't have to keep track of the counts. You can safely throw anything that doesn't make into top5 away.

answered May 24 '12 at 08:50

Haozhun

6,331
3
29
50

Your second alternative is the way to go. The divide-and-conquer approach is both easy to implement and you'll get linear time (if you are using hashmaps for counting in the smaller files). Sorting the file is really slow in comparison. – Emil Vikström May 24 '12 at 08:55
@EmilVikström, I agree with you. I wrote the third one as that's the first one coming into my mind. But anyway, in an interview, it is often the case that the interviewer will construct environment that make your previous solution an invalid one, to see if you can come up with anything else. – Haozhun May 24 '12 at 08:58
2

"You can safely throw anything that doesn't make into top5 away." -> No that's wrong. – Thomash May 24 '12 at 09:50
@Thomash Can you give an example? Note that it's a sorted list of strings. – Haozhun May 24 '12 at 10:00
I'll give you an example in my answer as I don't have enough room here. – Thomash May 24 '12 at 10:04
@aix Not if the items are partitioned by hash, as Gene suggests, because then all instances of a given URL are in one file only. – Nick Johnson May 25 '12 at 06:00

Thomash · Answer 2 · 2012-05-24T10:09:36.560

2

Just sort the log file according to the URLs (needs constant space if you chose an algorithm like heap sort or quick sort) and then count for each URL how many times it appears (easy, the lines with the same URLs are next to each other).

Overall complexity is O(n*Log(n)).

Why splitting in many files and keeping only top 3 (or top 5 or top N) for each file is wrong:

     File1 File2 File3
url1   5     0     5
url2   0     5     5
url3   5     5     0
url4   5     0     0
url5   0     5     0
url6   0     0     5
url7   4     4     4

url7 never makes it to the top 3 in the individual files but is the best overall.

edited May 24 '12 at 10:09

answered May 24 '12 at 09:22

Thomash

6,339
1
30
50

Constant space is still too much for a 5GB file. – Haozhun May 24 '12 at 09:31
No 5GB can be stored in main memory without problem. But if you consider you have only N GB in RAM, you can split the log file in N GB blocks and apply my solution to each block and then aggregate the results. – Thomash May 24 '12 at 09:35
1

Can you explain more about your split and aggregation method? – Haozhun May 24 '12 at 09:42
Split: very simple, the first N bytes (not exactly N because you don't want to cut a line) go to the first file and you repeat the operation on the rest of the file. – Thomash May 24 '12 at 09:47
Aggregate: file1 url1 visited a times, file2 url1 visited b times -> aggregate url1 visited a+b times. – Thomash May 24 '12 at 09:49
Your split and aggregation can still require more than 5GB memory in worst case: Only 5 urls appeared twice, all others appeared once. – Haozhun May 24 '12 at 10:00

Kunukn · Answer 3 · 2012-05-25T11:24:09.847

Because the log file is fairly large you should read the log-file using a stream-reader. Don't read it all in the memory. I would expect it is feasible to have the number of possible distinct links in the memory while we work on the log-file.

// Pseudo
Hashmap map<url,count>
while(log file has nextline){
    url = nextline in logfile
    add url to map and update count
}

List list 
foreach(m in map){
    add m to list         
}

sort the list by count value
take top n from the list

The runtime is O(n) + O(m*log(m)) where n is the size of the log-file in lines and where the m is number of distinct found links.

Here's a C# implementation of the pseudo-code. An actual file-reader and a log-file is not provided. A simple emulation of reading a log-file using a list in the memory is provided instead.

The algorithm uses a hashmap to store the found links. A sorting algorithm founds the top 100 links afterward. A simple data container data-structure is used for the sorting algorithm.

The memory complexity is dependent on expected distinct links. The hashmap must be able to contain the found distinct links, else this algorithm won't work.

// Implementation
using System;
using System.Collections.Generic;
using System.Linq;

public class Program
{
    public static void Main(string[] args)
    {
        RunLinkCount();
        Console.WriteLine("press a key to exit"); 
        Console.ReadKey();
    }

    class LinkData : IComparable
    {
        public string Url { get; set; }
        public int Count { get; set; }
        public int CompareTo(object obj)
        {
            var other = obj as LinkData;
            int i = other == null ? 0 : other.Count;
            return i.CompareTo(this.Count);
        }
    }

    static void RunLinkCount()
    {
        // Data setup
        var urls = new List<string>();
        var rand = new Random();
        const int loglength = 500000;
        // Emulate the log-file
        for (int i = 0; i < loglength; i++)
        {
            urls.Add(string.Format("http://{0}.com", rand.Next(1000)
                 .ToString("x")));
        }

        // Hashmap memory must be allocated 
        // to contain distinct number of urls
        var lookup = new Dictionary<string, int>();

        var stopwatch = new System.Diagnostics.Stopwatch();
        stopwatch.Start();

        // Algo-time
        // O(n) where n is log line count
        foreach (var url in urls) // Emulate stream reader, readline
        {
            if (lookup.ContainsKey(url))
            {
                int i = lookup[url];
                lookup[url] = i + 1;
            }
            else
            {
                lookup.Add(url, 1);
            }
        }

        // O(m) where m is number of distinct urls
        var list = lookup.Select(i => new LinkData 
             { Url = i.Key, Count = i.Value }).ToList();
        // O(mlogm)
        list.Sort();
        // O(m)
        var top = list.Take(100).ToList(); // top urls

        stopwatch.Stop();
        // End Algo-time

        // Show result
        // O(1)
        foreach (var i in top)
        {
            Console.WriteLine("Url: {0}, Count: {1}", i.Url, i.Count);
        }

        Console.WriteLine(string.Format("Time elapsed msec: {0}",
           stopwatch.ElapsedMilliseconds));
    }
}

Edit: This answer has been updated based on the comments

added: running time and memory complexity analysis
added: pseudo-code
added: explain how we manage a fairly large log-file

This answer may show us that you know C# but no one cares about it. The tags of the question are `algorithm` and `data-structures`, not `C#`. — Thomash, May 24 '12 at 09:54
Not meant as to demonstrate I know C#, but the intend was to 'show' the algorithm through code. — Kunukn, May 24 '12 at 10:07
I believe what @Thomash wants to point out is that you should provide your answer concisely. You will prefer reading a short answer to a long code. Don't you? By the way, the keyword in OP's question is "fairly large" but you clearly ignored it. — Haozhun, May 24 '12 at 12:01
I have updated my answer. Hope my answer is more clear now, else let me know. — Kunukn, May 24 '12 at 21:46
Not the best answer since sorting all pages is unnecessary; also sorting the top 100 is not specified. The fastest way to find k-smaller or k-larger elements in a list is use a the binary partition method from qsort and stop when your pivot = k (i.e. quickselect). The average complexity is O(m). See also: http://stackoverflow.com/questions/251781/how-to-find-the-kth-largest-element-in-an-unsorted-array-of-length-n-in-on — cmcginty, Sep 26 '16 at 07:54

Getting top 100 URL from a log file

3 Answers3

Linked