find most occurring word in text file using dictionary

Question

My question is similar to this but slightly different. I am trying to read through a file, looking for lines containing emails starting with 'From' and then creating a dictionary to store this emails, but also giving out the maximum occurring email address.

The line to be looked for in the files is this :

From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008

Any time this is found, the email part should be extracted out and then placed in a list before creating the dictionary.

I came upon this code sample for printing the maximum key,value in a dict:

counts = dict()  
names = ['csev','owen','csev','zqian','cwen']  
for name in names:  
  counts[name] = counts.get(name,0) + 1  
  maximum = max(counts, key = counts.get)
print maximum, counts[maximum]

From this sample code I then tried with this program:

import re

name = raw_input("Enter file:")
if len(name) < 1 : name = "mbox-short.txt"
handle = open(name)
matches = []
addy = []
counts = dict()

for lines in handle :
    # look for specific characters in document text
    if not lines.startswith("From ") : continue
    # increment the count variable for each math found
    lines.split()
    # append the required lines to the matches list
    matches.append(lines)
    # loop through the list to acess each line individually
    for email in matches :
        # place values in variable
        out = email
        # looking through each line for any email add found
        found = re.findall(r'[\w\.-]+@[\w\.-]+', out)
        # loop through the found emails and print them out
        for i in found :
            i.split()
            addy.append(i)
            for i in addy:
                counts[i] = counts.get(i, 0) + 1
                maximum = max(counts, key=counts.get)
    print counts
    print maximum, counts[maximum]

Now the issue is that there are only 27 lines starting with from and the highest recurring email in that list should be 'cwen@iupui.edu' which occurs 5 times but when i run the code my output becomes this

{'gopal.ramasammycook@gmail.com': 1640, 'louis@media.berkeley.edu': 7207, 'cwen@
iupui.edu': 8888, 'antranig@caret.cam.ac.uk': 1911, 'rjlowe@iupui.edu': 10678, '
gsilver@umich.edu': 10140, 'david.horwitz@uct.ac.za': 4205, 'wagnermr@iupui.edu'
: 2500, 'zqian@umich.edu': 16804, 'stephen.marquard@uct.ac.za': 7490, 'ray@media
.berkeley.edu': 168}

Here's the link to the text file for better understanding : text file

While I don't have an answer, this might interest you: https://pymotw.com/2/collections/counter.html — AncientSwordRage, Aug 12 '16 at 09:28
you could also use [https://docs.python.org/2/library/collections.html#collections.defaultdict](https://docs.python.org/2/library/collections.html#collections.defaultdict), i.e., `counts = defaultdict( int )` instead of your dictinary `counts` then you could write something like `counts[email]+=1`, where email is a email address as string. — desiato, Aug 12 '16 at 09:44
@desiato Thanks! I've seen a `defaultdict` before but wanted to keep it simple here. — Noelkd, Aug 15 '16 at 10:45

score 1 · Accepted Answer · answered Aug 12 '16 at 09:57

You have a couple of issues.

The first one is the for email in matches loop is being called for each line in the text file.

for lines in handle :
    # look for specific characters in document text
    if not lines.startswith("From ") : continue
    # increment the count variable for each math found
    lines.split()
    # append the required lines to the matches list
    matches.append(lines)

# loop through the list to acess each line individually
for email in matches:

So with that change you know are iterating over the matches once.

Then since we know that there are only one from in each match we can change the find to:

found = re.findall(r'[\w\.-]+@[\w\.-]+', out)[0]

To count how many of each we've seen i've changed:

# loop through the found emails and print them out
for i in found :
    i.split()
    addy.append(i)
    for i in addy:
        counts[i] = counts.get(i, 0) + 1
        maximum = max(counts, key=counts.get)

To a more readable:

if found in counts:
    counts[found] += 1
else:
    counts[found] = 1

Then you can get the max out at the end rather than saving it all the time like so:

print counts
print max(counts, key=lambda x : x[1])

Putting it togeather you get:

import re

name = raw_input("Enter file:")
if len(name) < 1 : 
    name = "mbox-short.txt"
handle = open(name)
matches = []
addy = []
counts = dict()

for lines in handle :
    # look for specific characters in document text
    if not lines.startswith("From ") : continue
    # increment the count variable for each math found
    lines.split()
    # append the required lines to the matches list
    matches.append(lines)

# loop through the list to acess each line individually
for email in matches:
    # place values in variable
    out = email
    # looking through each line for any email add found
    found = re.findall(r'[\w\.-]+@[\w\.-]+', out)[0]
    # loop through the found emails and print them out
    if found in counts:
        counts[found] += 1
    else:
        counts[found] = 1

print counts
print max(counts, key=lambda x : x[1])

Which returns:

{'gopal.ramasammycook@gmail.com': 1, 'louis@media.berkeley.edu': 3, 'cwen@iupui.edu': 5, 'antranig@caret.cam.ac.uk': 1, 'rjlowe@iupui.edu': 2, 'gsilver@umich.edu': 3, 'david.horwitz@uct.ac.za': 4, 'wagnermr@iupui.edu': 1, 'zqian@umich.edu': 4, 'stephen.marquard@uct.ac.za': 2, 'ray@media.berkeley.edu': 1}
cwen@iupui.edu

yes that worked well, I even came up with a solution of mine own after taking a step back and working through each piece of the initial code, when i realised that the found variable returns each emails as lists. — moodygeek, Aug 12 '16 at 11:04

score 1 · Answer 2 · edited Aug 12 '16 at 10:14

1

lines.split() will not change lines, as in i.split(), use print to verify this temp values.

check whether the for loops do as you want.

import re
import collections

addy = []

with open("mbox-short.txt") as handle:
    for lines in handle :
        if not lines.startswith("From ") : continue
        found = re.search(r'[\w\.-]+@[\w\.-]+', lines).group()
        addy.append(found.split('@')[0])
print collections.Counter(addy).most_common(1)
# out: [('cwen', 5)]

edited Aug 12 '16 at 10:14

Ohad Eytan

8,114
1
22
31

answered Aug 12 '16 at 10:03

citaret

416
4
11

This worked thanks but the level of my python learning at the stage of the course i'm at, hasn't discussed collections yet. I'll read through the docs on collections tho as a result of this :) – moodygeek Aug 12 '16 at 14:12

score 0 · Answer 3 · answered Aug 12 '16 at 09:34

0

Your loop over matches and over found do not have the right indentation. First you iterate over all lines in file and add all lines which start with "From " to matches. Afterwards you have to iterate over those matches. Similarily, for matched line you add all email addresses to addy. Afterwards you have to iterate over this list. I.e.,

for lines in handle :
    # look for specific characters in document text
    ...

for email in matches :
    ...

    for i in found :
        i.split()
        addy.append(i)

    for i in addy:
        counts[i] = counts.get(i, 0) + 1
        maximum = max(counts, key=counts.get)

answered Aug 12 '16 at 09:34

desiato

1,122
1
9
16

yeah i tried that, still didn't work. Was still getting the same output – moodygeek Aug 12 '16 at 09:45
you get the exact the same ouput as before? then please send a short example file content and the desired output. – desiato Aug 12 '16 at 09:49
yep the same output. I put a link to the file itself in my question. The desired output should be `cwen@iupui.edu 5` – moodygeek Aug 12 '16 at 11:11

score 0 · Answer 4 · edited May 23 '17 at 12:22

Came up with an answer almost similar to that of @Noelkd, after further reflection on my code:

import re

name = raw_input("Enter file:")
if len(name) < 1 : name = "mbox-short.txt"
handle = open(name)

email_matches = []
found_emails = []
final_emails = []
counts = dict()

for lines in handle :
    # look for specific characters in document text
    if not lines.startswith("From ") : continue
    # increment the count variable for each math found
    lines.split()
    # append the required lines to the matches list
    email_matches.append(lines)

for email in email_matches :
    out = email
    found = re.findall(r'[\w\.-]+@[\w\.-]+',  out)
    found_emails.append(found)

for item in found_emails :
    count = item[0]
    final_emails.append(count)

for items in final_emails:
    counts[items] = counts.get(items,0) + 1
    maximum = max(counts, key = lambda x: counts.get(x))
print maximum, counts[maximum]

And the output

cwen@iupui.edu 5

find most occurring word in text file using dictionary

4 Answers4