Is there a data structure that I can use to solve this?

Question

Given a log file with data like:

USER FROM_PAGE TO_PAGE
A url1 url2
A url1 url3
B url1 url3
A url2 url3
...
...
url can be string like www.google.com/activity/xyz

Return the possiblity of any user moving from one page to another page

I thought of using a dictionary but couldn't come up with a solution

Expected ouput should be like:

user A:
url1 ---> url2: 50%
url1 ---> url3: 50%
url2 ---> url3 : 100%

user B:
url1 ---> url2 : 100%

Dictionary of dictionaries? Dictionary of lists of pairs? The answer depends on how do you plan to use this structure. — Dmitry Kuzminov, May 20 '19 at 22:28
Are you just looking for a way to store this information or a way to calculate the percentages? — Mark, May 20 '19 at 22:30
A data structure is only part of a solution; you need a full algorithm. What data manipulations do you plan to use for solving this? Post your algorithm with a clear description, and we can suggest a data structure to support your flow. If you don't have an algorithm either, then you're not yet ready to post to Stack Overflow. — Prune, May 20 '19 at 22:31
What you're describing is a transition matrix. Let me know if the linked answers don't help, glad to reopen. — Brad Solomon, May 20 '19 at 22:33
@BradSolomon the linked answer doesn't address the subdivision by users, also the question is about a predefined size of the transition matrix, while here you know the size of the transition matrix (possibly a very large one) only at the end of the input phase, Is it possible that this is better solved used a sparse structure using a `defaultdict` keyed on users to hold `Counter`s that count the different transitions. — gboffi, May 20 '19 at 22:48
I think @gboffi is right - `defaultdict` and `Counter` produce a ~20 line solution I couldn't get around to posting before it was marked duplicate. You don't need heavy dependencies like `pandas` for this! — Ben, May 20 '19 at 22:50

Ben · Answer 1 · 2019-05-20T23:13:12.950

You can use the collections module to make this very clean. This solution uses a defaultdict to auto-create a new Counter whenever a new user is seen, then adds one to that counter for every redirect.

At the end of the "read from file" loop, we then have a data structure that looks like: {user : {(url1, url2): count}}. This organization makes everything pretty easy to print in the second loop.

from collections import Counter, defaultdict

users_to_stats = defaultdict(Counter)

with open('tmp.txt') as fp:
    for line in fp:
        user, url1, url2 = line.split()

        users_to_stats[user][(url1, url2)] += 1

for user, counts in users_to_stats.items():
    print(user)

    total_redirects_per_user = sum(counts.values())
    for ((url1, url2), count) in counts.items():
        print(f'{url1} -> {url2} : {count / total_redirects_per_user}')

Prints:

A
url1 -> url2 : 0.5
url1 -> url3 : 0.25
url2 -> url3 : 0.25
B
url1 -> url3 : 1.0

This is the answer I was also writing when the question was temporarily closed... but neater. You have my vote. — gboffi, May 21 '19 at 07:38

score 1 · Answer 2 · answered May 20 '19 at 22:27

1

You can use a tuple as keys to a dictionary

For example

possibility = {}
possibility[(A, url1, url2)] = 0.5
possibility[(B, url1, url2)] = 1

answered May 20 '19 at 22:27

Anthony Kong

37,791
46
172
304

Is there a data structure that I can use to solve this?

2 Answers2