4

I have the following requirement to implement which poses a "puzzle" to me:
I have web server and various users (authenticated and logged-in) visit various areas of the web-site (i.e. follow and browse various links). These actions (or call it browsing) is being logged into log files.
So these files capture the date a user visited the server and the various links, i.e. URLs, it accessed.
A simplified format of the records (for explanation purposes) can be as follows:
Timestamp User-Name URL-1
So to give a simplified example of the logs we could have (assume valid dates for this):

Date-1 John    URL-1  
Date-1 Nick    URL-1  
Date-1 John    URL-2  
Date-1 George  URL-1  
Date-1 George  URL-2
Date-1 Eve     URL-2  
Date-1 Nick    URL-2  
Date-1 John    URL-3
Date-1 George  URL-3  
Date-1 John    URL-5  
Date-1 Nick    URL-3  
Date-1 Bill    URL-2  
Date-1 George  URL-5
Date-1 Nick    URL-5      
Date-1 Eve     URL-3                
Date-1 Eve     URL-5   

etc and there can be hundrends/thousands of entries
When I say URL-1 I mean a valid URL for the site and so URL-1 in John and Eve really means they both visited the same link. In this example URL-2,URL-3,URL-5 is the maximum common accessed URLs sequence.

Problem: I am interested in using this information and find the most frequent accessed sequence of URLs accessed by all users both in the whole date-time range covered by log files and/or a specific date-time.
I have some first thoughts on how to go for this. E.g. my first thought was to store everything in HashMaps and include counters for each appearance and then loop over the map entries to find max but it seems to me that it has huge overhead both in space and runtime.
Also the more I think about this, the more it seems that it might have a "standard" solution like for example for string pattern matching one would follow KMP algorithm.
I then thought if I could use e.g. suffix trees but I only know to implement a trie and the space complexity for this would be I believe O(N^2). I know that there are compressed versions but I think they are too complex and I wouldn't want to lose time in case there is a better/standard solution to this problem.

Any suggestions/input is highly appreciated.

Cratylus
  • 52,998
  • 69
  • 209
  • 339
  • Please, clarify, you are speaking exactly about **sequences of URLs**? Or about separated **URLs**? – Andremoniy Jan 11 '13 at 18:33
  • @Andremoniy:I don't understand your question. I mean `URL-2,URL-3,URL5`.This is the visited order – Cratylus Jan 11 '13 at 18:35
  • I think you should consider a database to store the hits on your site. Each time you restart your application you will have to reparse all the log files which will be a massive overhead. When it is in a db you can just query what you need. – tom Jan 11 '13 at 18:35
  • @Cratylus, ok, I just clarify, because you wrote: *is the maximum common accessed URL*, it may be should be: *...maximum common accesses URL **sequence***? Shouldn't it? – Andremoniy Jan 11 '13 at 18:37
  • 1
    @Andremoniy:Ok, updated OP – Cratylus Jan 11 '13 at 18:43
  • @tom:This information is not part of the database.It is part of the logs and perhaps in the future I can "upgrade" my schema for this purpose but for now this is the only place I can get this information – Cratylus Jan 11 '13 at 18:44
  • @Cratylus : if that is the case, make sure that your log parsing application saves results in some sort of file (or even a db) because you really do not want to reparse everything after an upgrade, a crash or something like that. Applications can run for years and create gigabytes of logs. – tom Jan 11 '13 at 18:52
  • Just for the log analysis, could you load the files to a database and perform analysis via SQLs. Considering the potential size f these logs, processing them via java may not be very performant. That said, I am very curious to see what answers come up. – Nivas Jan 11 '13 at 18:54
  • @Nivas:Well in the worse case we could change the logging file rolling configuration to have files of more "manageable" size – Cratylus Jan 11 '13 at 18:57
  • you are talking about a rolling file configuration. If you are using Log4J are something like that, they most likely have a database appender which would save your life. – tom Jan 11 '13 at 18:59
  • @tom:Can not update schema for the moment.Will follow your advice as soon as I can – Cratylus Jan 11 '13 at 19:04

2 Answers2

3

Well, you said, that Any suggestions/input is highly appreciated.. So let me suggest you briefly following algorithm:

  1. Filter log file for needed date range, collecting URL sequences for each user parallel in some List .

  2. After step 1. you have a set of big sequences. In this step this issue is equivalent to task of find most common substring in list of strings. This is already solved problem.

UPD: After that consider each URL like a "char" in some "string".

Community
  • 1
  • 1
Andremoniy
  • 34,031
  • 20
  • 135
  • 241
  • That is what I originally thought that is why I mentioned suffix trees.Are you suggesting DP? – Cratylus Jan 11 '13 at 18:59
  • Yes, exactly DP. In this case each URL will be like `char` in `string`. – Andremoniy Jan 11 '13 at 18:59
  • The space requirement is still `O(N^2)` right?Why would this be better than a suffix trie? – Cratylus Jan 11 '13 at 19:03
  • Its just an idea. I just recounted my abstract view about this problem. I don't think it would be considerably better then suffix tree. Furthermore, it could be not bad decision. But you wrote you do not want *lost time* for such things. So I suggest you use already implemented algorithms, but only thing is walking up to another level of abstraction: `URL` instead of `char`. – Andremoniy Jan 11 '13 at 19:11
  • I know that suffix trees (compressed) are hard to understand and implement.I only know suffix tries.If the best solution is a suffix tree I would do it.But I wouldn't want to invest time for this if there is a better alternative – Cratylus Jan 11 '13 at 19:43
0

I'm sorry but I do not think it is possible to achieve this with the data you have in your log files.

The problem I see is that you are looking for the most used sequence of URL's. In your question you only have the userId and not a session indicator which means that you cannot reliably find out what they have been doing during a single session. You might be mixing different sessions when trying to find out what the path was they were taking.

Suppose you had a sessionId you could create a path of each session and run some (still unknown) program on it to find the most used 'arcs'.

tom
  • 2,735
  • 21
  • 35
  • It has been a long time since high school but I'm thinking in the direction of Dijkstra graphs or some derivatives. – tom Jan 11 '13 at 19:25
  • But why would I care about the session?I only need to know that `URL-1` has been visited or not.What does session have to do with this? – Cratylus Jan 11 '13 at 19:42
  • @Tom, lines in log file are ordered in time order. It means, that if user `USER-1` appears in log file with `URL-1` and then after some lines that user `USER-1` appears with `URL-3` - so `USER-1` visited `URL-1`, after that `URL-3`, so `URL-1, URL-3` is his sequence. – Andremoniy Jan 11 '13 at 19:47
  • @Andremoniy but what if he did the same thing twice a day. Would you consider `URL-1 URL-3 URL-1 URL-3` a sequence in that case? – tom Jan 12 '13 at 11:25
  • @Tom yes, but this doesn't matter at all, because task is to find most frequent sequence between all including **subsequences**. – Andremoniy Jan 12 '13 at 11:36