0

I'd like to ask for help with finishing my python code.

I have a huge text file, filled with 3 columns:

  • First has User names, for example: user_003
  • Second has number of visits, for example visit_456
  • Third has datestamps of these visits.

Example:

(...)
user_123    visit_188   1330796847
user_123    visit_188   1330797173
user_123    visit_189   1330802227
user_123    visit_189   1330802277
user_123    visit_190   1330806287
user_123    visit_190   1330806353
(...)

I've written a small portion of a script that counts the frequencies of ALL words in my text file: user names, visits and datestamps

I can easily print out the number of several first most appearing words (for the moment I've filled the value of the 'most.common' definition with the number 10).

All I need to do now is to filter out the precise results of my script, so I'd be able to show only (not a whole list of the word appearances):

  1. first: what is the name and the number of the most common visit
  2. second: what is the name of the user that appears the most in my text file

I've tried several things, but sadly nothing comes to my mind atm. I'll gladly accept any help. Thanks in advance.

My code:

import re
from collections import Counter

with open("bigfile.txt", "r") as f:
    data = f.read()

words = re.findall(r'\w+', data)

word_counts = Counter(words).most_common(10)

print(word_counts)

output:

[('user_819', 27), ('user_356', 25), ('visit_637', 25), ('user_520', 24), ('user_1222', 24), ('user_191', 22), ('user_473', 22), ('user_542', 22), ('user_812', 22), ('visit_1383', 22)]
marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459

3 Answers3

1

Since you have a " huge text file", a faster method would be to use Python Pandas to avoid Python for loops (which are slow).

Code

df = pd.read_csv("bigfile.txt", header=None, sep='\s+')  # Read csv into Dataframe
df.columns = ['users', 'visits', 'dates']                # Name columns

# Most frequent user
n = 1                                                    # top n i.e. could be 1, 2, 3, etc.
print(df['users'].value_counts()[:n])                              

# Most frequent visit
print(df['visits'].value_counts()[:n])

Example

File: bigfile.txt

user_123    visit_188   1330796847
user_123    visit_188   1330797173
user_123    visit_189   1330802227
user_123    visit_189   1330802277
user_123    visit_190   1330806287
user_123    visit_190   1330806353
user_123    visit_190   1330806353
user_456    visit_191   1330806354

Result for df['users'].value_counts()[:n] shows user_123 occurred 7 times

user_123    7
Name: users, dtype: int64

Result for df['visits'].value_counts()[:n] shows visit_190 occured 3 times

visit_190    3
Name: visits, dtype: int64
DarrylG
  • 16,732
  • 2
  • 17
  • 23
  • Your calculations are correct. I've double-checked with my original text file. I didn't know the pandas module so I was trying to do it 'the hard and primitive' way. Thanks to you I'll learn the module. Thank you! – Tomasz Wąsowicz May 01 '21 at 19:22
  • I've tried to upvote your answer, but I'don't have the reputation required to do so. I've gladly accepted your answer. – Tomasz Wąsowicz May 01 '21 at 19:40
  • @TomaszWąsowicz--no worries. – DarrylG May 01 '21 at 20:34
1

Also possible without libraries. This just prints the top (user, visit) tuple.

data = """user_123    visit_188   1330796847
user_123    visit_188   1330797173
user_123    visit_188   1330797173
user_123    visit_188   1330797173
user_123    visit_189   1330802227
user_123    visit_189   1330802277
user_123    visit_190   1330806287
user_123    visit_190   1330806353
"""

c = {}
for line in data.split('\n'):
    idx = tuple(line.split()[:2])
    if idx in c:
        c[idx] += 1
    else:
        c[idx] = 1
ordered = sorted(c.items(), key=lambda x: x[1], reverse=True)
print(ordered[0])
jwal
  • 630
  • 6
  • 16
  • 1
    Your solution also works, thank you. I've tried to upvote, but I have to catch 2 more points for my upvotes to be visible. – Tomasz Wąsowicz May 01 '21 at 19:46
  • Cheers, as always so many possible approaches to solving an issue. Python is a little more opinionated that other languages which is actually nice. – jwal May 02 '21 at 06:45
0

You need to parse out the specific user names and visit counts and maintain two separate counters:

import re
from collections import Counter

with open("bigfile.txt", "r") as f:
    data = f.read()
    
visit_counter = Counter()
user_counter = Counter()
rex = re.compile(r'^(\w+)\s+(visit_\d+)')
for line in data.split('\n'):
    m = rex.search(line)
    if m:
        user = m[1]
        visit = m[2]
        user_counter[user] += 1
        visit_counter[visit] += 1
most_common_visits, most_common_visits_number = visit_counter.most_common(1)[0]
print('most common visits:', most_common_visits, 'number:', most_common_visits_number)
print('most common user:', user_counter.most_common(1)[0][0])
Booboo
  • 38,656
  • 3
  • 37
  • 60
  • Wow, please give me a while to analyze the code – Tomasz Wąsowicz May 01 '21 at 19:12
  • MM ok, i think the calculations are not correct ( The calculations made above by DarrylG are ), but You made it the way I wanted to, I think the code needs a little bit corrections to show proper results. Thank you for the huge effort. – Tomasz Wąsowicz May 01 '21 at 19:20
  • It's ok, I am sorry your answer is downvoted. After the corrections you made, your code is a thing i was trying to do. Thank you again. Now, after reaching 15 points I was able to upvote you, thanks again – Tomasz Wąsowicz May 01 '21 at 19:45
  • No need to apologize. I am the one who needs to apologize for making false assumptions. – Booboo May 01 '21 at 20:33