0

Given a log file with data like:

USER FROM_PAGE TO_PAGE
A url1 url2
A url1 url3
B url1 url3
A url2 url3
...
...
url can be string like www.google.com/activity/xyz

Return the possiblity of any user moving from one page to another page

I thought of using a dictionary but couldn't come up with a solution

Expected ouput should be like:

user A:
url1 ---> url2: 50%
url1 ---> url3: 50%
url2 ---> url3 : 100%

user B:
url1 ---> url2 : 100%
eyllanesc
  • 235,170
  • 19
  • 170
  • 241
  • Dictionary of dictionaries? Dictionary of lists of pairs? The answer depends on how do you plan to use this structure. – Dmitry Kuzminov May 20 '19 at 22:28
  • Are you just looking for a way to store this information or a way to calculate the percentages? – Mark May 20 '19 at 22:30
  • A data structure is only part of a solution; you need a full algorithm. What data manipulations do you plan to use for solving this? Post your algorithm with a clear description, and we can suggest a data structure to support your flow. If you don't have an algorithm either, then you're not yet ready to post to Stack Overflow. – Prune May 20 '19 at 22:31
  • 1
    What you're describing is a transition matrix. Let me know if the linked answers don't help, glad to reopen. – Brad Solomon May 20 '19 at 22:33
  • 1
    @BradSolomon the linked answer doesn't address the subdivision by users, also the question is about a predefined size of the transition matrix, while here you know the size of the transition matrix (possibly a very large one) only at the end of the input phase, Is it possible that this is better solved used a sparse structure using a `defaultdict` keyed on users to hold `Counter`s that count the different transitions. – gboffi May 20 '19 at 22:48
  • I think @gboffi is right - `defaultdict` and `Counter` produce a ~20 line solution I couldn't get around to posting before it was marked duplicate. You don't need heavy dependencies like `pandas` for this! – Ben May 20 '19 at 22:50
  • Good point, sorry to jump the gun - reopened. – Brad Solomon May 20 '19 at 23:01

2 Answers2

2

You can use the collections module to make this very clean. This solution uses a defaultdict to auto-create a new Counter whenever a new user is seen, then adds one to that counter for every redirect.

At the end of the "read from file" loop, we then have a data structure that looks like: {user : {(url1, url2): count}}. This organization makes everything pretty easy to print in the second loop.

from collections import Counter, defaultdict

users_to_stats = defaultdict(Counter)

with open('tmp.txt') as fp:
    for line in fp:
        user, url1, url2 = line.split()

        users_to_stats[user][(url1, url2)] += 1

for user, counts in users_to_stats.items():
    print(user)

    total_redirects_per_user = sum(counts.values())
    for ((url1, url2), count) in counts.items():
        print(f'{url1} -> {url2} : {count / total_redirects_per_user}')

Prints:

A
url1 -> url2 : 0.5
url1 -> url3 : 0.25
url2 -> url3 : 0.25
B
url1 -> url3 : 1.0
Ben
  • 5,952
  • 4
  • 33
  • 44
1

You can use a tuple as keys to a dictionary

For example

possibility = {}
possibility[(A, url1, url2)] = 0.5
possibility[(B, url1, url2)] = 1
Anthony Kong
  • 37,791
  • 46
  • 172
  • 304