I have the following requirement to implement which poses a "puzzle" to me:
I have web server and various users (authenticated and logged-in) visit various areas of the web-site (i.e. follow and browse various links). These actions (or call it browsing) is being logged into log files.
So these files capture the date a user visited the server and the various links, i.e. URLs, it accessed.
A simplified format of the records (for explanation purposes) can be as follows:
Timestamp User-Name URL-1
So to give a simplified example of the logs we could have (assume valid dates for this):
Date-1 John URL-1
Date-1 Nick URL-1
Date-1 John URL-2
Date-1 George URL-1
Date-1 George URL-2
Date-1 Eve URL-2
Date-1 Nick URL-2
Date-1 John URL-3
Date-1 George URL-3
Date-1 John URL-5
Date-1 Nick URL-3
Date-1 Bill URL-2
Date-1 George URL-5
Date-1 Nick URL-5
Date-1 Eve URL-3
Date-1 Eve URL-5
etc and there can be hundrends/thousands of entries
When I say URL-1
I mean a valid URL for the site and so URL-1
in John and Eve really means they both visited the same link. In this example URL-2,URL-3,URL-5
is the maximum common accessed URLs sequence.
Problem: I am interested in using this information and find the most frequent accessed sequence of URLs accessed by all users both in the whole date-time range covered by log files and/or a specific date-time.
I have some first thoughts on how to go for this. E.g. my first thought was to store everything in HashMaps
and include counters for each appearance and then loop over the map entries to find max but it seems to me that it has huge overhead both in space and runtime.
Also the more I think about this, the more it seems that it might have a "standard" solution like for example for string pattern matching one would follow KMP algorithm
.
I then thought if I could use e.g. suffix trees but I only know to implement a trie and the space complexity for this would be I believe O(N^2)
. I know that there are compressed versions but I think they are too complex and I wouldn't want to lose time in case there is a better/standard solution to this problem.
Any suggestions/input is highly appreciated.