7

Given a web log which consists of fields 'User ' 'Page url'. We have to find out the most frequent 3-page sequence that users takes.

There is a time stamp. and it is not guaranteed that the single user access will be logged sequentially it could be like user1 Page1 user2 Pagex user1 Page2 User10 Pagex user1 Page 3 her User1s page sequence is page1-> page2-> page 3

Sundararajan S
  • 1,358
  • 2
  • 14
  • 24
  • Are items guaranteed to be added in access order, or is there a timestamp as well? – jball Jun 07 '10 at 16:58
  • There is a time stamp. and it is not guaranteed that the single user access will be logged sequentially it could be like user1 Page1 user2 Pagex user1 Page2 User10 Pagex user1 Page 3 her User1s page sequence is page1-> page2-> page 3 – Sundararajan S Jun 07 '10 at 17:00
  • 1
    The easiest way would be to concatenate every 3 URLs in a row, then count how many times each concatenated version appears. This will be implementation dependent of course... where are you getting this information? Are you using a database? Are you reading from a log file? – John Rasch Jun 07 '10 at 17:02
  • How time/memory dependent is your ideal solution and how large are the number of available URLs and number of log entries? – jball Jun 07 '10 at 17:06
  • For a ruby solution, see [this post](https://ibraheem.ca/posts/most-common-n-page-path) – Ibraheem Ahmed Aug 14 '20 at 20:56

6 Answers6

5

Assuming your log is stored in timestamp order, here's an algorithm to do what you need:

  1. Create a hashtable 'user_visits' mapping user ID to the last two pages you observed them to visit
  2. Create a hashtable 'visit_count' mapping 3-tuples of pages to frequency counts
  3. For each entry (user, URL) in the log:
    1. If 'user' exists in user_visits with two entries, increment the entry in visit_count corresponding to the 3-tuple of URLs by one
    2. Append 'URL' to the relevant entry in user_visits, removing the oldest entry if necessary.
  4. Sort the visit_count hashtable by value. This is your list of most popular sequences of URLs.

Here's an implementation in Python, assuming your fields are space-separated:

fh = open('log.txt', 'r')
user_visits = {}
visit_counts = {}
for row in fh:
  user, url = row.split(' ')
  prev_visits = user_visits.get(user, ())
  if len(prev_vists) == 2:
    visit_tuple = prev_vists + (url,)
    visit_counts[visit_tuple] = visit_counts.get(visit_tuple, 0) + 1
  user_visits[user] = (prev_vists[1], url)
popular_sequences = sorted(visit_counts, key=lambda x:x[1], reverse=True)
Nick Johnson
  • 100,655
  • 16
  • 128
  • 198
  • Could you or someone else explain what "user_visits.get(user, ())" is supposed to return? A few of the other lines could use comment as well :) – Organiccat Oct 07 '12 at 20:07
  • @Organiccat `.get` is a Python standard library call on the dict; it returns the value corresponding to the key in the first argument if it exists in the dict, and returns the second argument if it doesn't. `()` is a Python idiom for an empty Tuple. – Nick Johnson Oct 08 '12 at 13:15
  • The spec asks for just the most frequent 3-page sequence, not an entire sorted list of them. That means you can improve the efficiency by tracking the most frequent sequence as you are iterating through the log in step 3.1. This removes the sorting in step 4 entirely. – sojo2600 Jan 22 '20 at 21:49
3

Quick and dirty:

  • Build a list of url/timestamps per user
  • sort each list by timestamp
  • iterate over each list
    • for each 3 URL sequence, create or increment a counter
  • find the highest count in the URL sequence count list

foreach(entry in parsedLog)
{
    users[entry.user].urls.add(entry.time, entry.url)
}

foreach(user in users)
{
    user.urls.sort()
    for(i = 0; i < user.urls.length - 2; i++)
    {
        key = createKey(user.urls[i], user.urls[i+1], user.urls[i+2]
        sequenceCounts.incrementOrCreate(key);
    }
}

sequenceCounts.sortDesc()
largestCountKey = sequenceCounts[0]
topUrlSequence = parseKey(largestCountkey)
jball
  • 24,791
  • 9
  • 70
  • 92
2

Here's a bit of SQL assuming you could get your log into a table such as

CREATE TABLE log (
   ord  int,
   user VARCHAR(50) NOT NULL,
   url  VARCHAR(255) NOT NULL,
   ts   datetime
) ENGINE=InnoDB;

If the data is not sorted per user then (assuming that ord column is the number of the line from the log file)

SELECT t.url, t2.url, t3.url, count(*) c
FROM  
      log t INNER JOIN
      log t2 ON t.user = t2.user INNER JOIN
      log t3 ON t2.user = t3.user
WHERE 
      t2.ord IN (SELECT MIN(ord) 
                 FROM log i 
                 WHERE i.user = t.user AND i.ord > t.ord) 
      AND
      t3.ord IN (SELECT MIN(ord) 
                 FROM log i 
                 WHERE i.user = t.user AND i.ord > t2.ord)
GROUP BY t.user, t.url, t2.url, t3.url
ORDER BY c DESC
LIMIT 10;

This will give top ten 3 stop paths for a user. Alternatively if you can get it ordered by user and time you can join on rownumbers more easily.

Unreason
  • 12,556
  • 2
  • 34
  • 50
1

Source code in Mathematica

s= { {user},{page} }  (* load List (log) here *)

sortedListbyUser=s[[Ordering[Transpose[{s[[All, 1]], Range[Length[s]]}]] ]]

Tally[Partition [sortedListbyUser,3,1]]
Dr. belisarius
  • 60,527
  • 15
  • 115
  • 190
1

This problem is similar to

Find k most frequent words from a file

Here is how you can solve it :

  • Group each triplet (page1, page2, page3) into a word
  • Apply the algorithm mentioned here
Community
  • 1
  • 1
craftsmannadeem
  • 2,665
  • 26
  • 22
1
1.Reads user page access urls from file line by line,these urls separated by separator,eg: 
u1,/
u1,main
u1,detail

The separator is comma.
2.Store each page's visit count to map:pageVisitCounts;
3.Sort the visit count map by value in descend order;

public static Map<String, Integer> findThreeMaxPagesPathV1(String file, String separator, int depth) {
    Map<String, Integer> pageVisitCounts = new HashMap<String, Integer>();
    if (file == null || "".equals(file)) {
        return pageVisitCounts;
    }
    try {
        File f = new File(file);
        FileReader fr = new FileReader(f);
        BufferedReader bf = new BufferedReader(fr);

        Map<String, List<String>> userUrls = new HashMap<String, List<String>>();
        String currentLine = "";
        while ((currentLine = bf.readLine()) != null) {
            String[] lineArr = currentLine.split(separator);
            if (lineArr == null || lineArr.length != (depth - 1)) {
                continue;
            }
            String user = lineArr[0];
            String page = lineArr[1];
            List<String> urlLinkedList = null;
            if (userUrls.get(user) == null) {
                urlLinkedList = new LinkedList<String>();
            } else {
                urlLinkedList = userUrls.get(user);
                String pages = "";
                if (urlLinkedList.size() == (depth - 1)) {
                    pages = urlLinkedList.get(0).trim() + separator + urlLinkedList.get(1).trim() + separator + page;
                } else if (urlLinkedList.size() > (depth - 1)) {
                    urlLinkedList.remove(0);
                    pages = urlLinkedList.get(0).trim() + separator + urlLinkedList.get(1).trim() + separator + page;
                }
                if (!"".equals(pages) && null != pages) {
                    Integer count = (pageVisitCounts.get(pages) == null ? 0 : pageVisitCounts.get(pages))  + 1;
                    pageVisitCounts.put(pages, count);
                }
            }
            urlLinkedList.add(page);
            System.out.println("user:" + user + ", urlLinkedList:" + urlLinkedList);
            userUrls.put(user, urlLinkedList);
        }
        bf.close();
        fr.close();
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }
    return pageVisitCounts;
}

public static void main(String[] args) {
    String file = "/home/ieee754/Desktop/test-access.log";
    String separator = ",";
    Map<String, Integer> pageVisitCounts = findThreeMaxPagesPathV1(file, separator, 3);
    System.out.println(pageVisitCounts.size());
    Map<String, Integer>  result = MapUtil.sortByValueDescendOrder(pageVisitCounts);
    System.out.println(result);
}
  • While this code snippet may solve the question, [including an explanation](http://meta.stackexchange.com/questions/114762/explaining-entirely-code-based-answers) really helps to improve the quality of your post. Remember that you are answering the question for readers in the future, and those people might not know the reasons for your code suggestion. – DimaSan Dec 15 '16 at 08:36