Because the log file is fairly large you should read the log-file using a stream-reader. Don't read it all in the memory.
I would expect it is feasible to have the number of possible distinct links in the memory while we work on the log-file.
// Pseudo
Hashmap map<url,count>
while(log file has nextline){
url = nextline in logfile
add url to map and update count
}
List list
foreach(m in map){
add m to list
}
sort the list by count value
take top n from the list
The runtime is O(n) + O(m*log(m)) where n is the size of the log-file in lines and where the m is number of distinct found links.
Here's a C# implementation of the pseudo-code. An actual file-reader and a log-file is not provided.
A simple emulation of reading a log-file using a list in the memory is provided instead.
The algorithm uses a hashmap to store the found links. A sorting algorithm founds the top 100 links afterward. A simple data container data-structure is used for the sorting algorithm.
The memory complexity is dependent on expected distinct links.
The hashmap must be able to contain the found distinct links,
else this algorithm won't work.
// Implementation
using System;
using System.Collections.Generic;
using System.Linq;
public class Program
{
public static void Main(string[] args)
{
RunLinkCount();
Console.WriteLine("press a key to exit");
Console.ReadKey();
}
class LinkData : IComparable
{
public string Url { get; set; }
public int Count { get; set; }
public int CompareTo(object obj)
{
var other = obj as LinkData;
int i = other == null ? 0 : other.Count;
return i.CompareTo(this.Count);
}
}
static void RunLinkCount()
{
// Data setup
var urls = new List<string>();
var rand = new Random();
const int loglength = 500000;
// Emulate the log-file
for (int i = 0; i < loglength; i++)
{
urls.Add(string.Format("http://{0}.com", rand.Next(1000)
.ToString("x")));
}
// Hashmap memory must be allocated
// to contain distinct number of urls
var lookup = new Dictionary<string, int>();
var stopwatch = new System.Diagnostics.Stopwatch();
stopwatch.Start();
// Algo-time
// O(n) where n is log line count
foreach (var url in urls) // Emulate stream reader, readline
{
if (lookup.ContainsKey(url))
{
int i = lookup[url];
lookup[url] = i + 1;
}
else
{
lookup.Add(url, 1);
}
}
// O(m) where m is number of distinct urls
var list = lookup.Select(i => new LinkData
{ Url = i.Key, Count = i.Value }).ToList();
// O(mlogm)
list.Sort();
// O(m)
var top = list.Take(100).ToList(); // top urls
stopwatch.Stop();
// End Algo-time
// Show result
// O(1)
foreach (var i in top)
{
Console.WriteLine("Url: {0}, Count: {1}", i.Url, i.Count);
}
Console.WriteLine(string.Format("Time elapsed msec: {0}",
stopwatch.ElapsedMilliseconds));
}
}
Edit: This answer has been updated based on the comments
- added: running time and memory complexity analysis
- added: pseudo-code
- added: explain how we manage a fairly large log-file