1

My question is similar to this but slightly different. I am trying to read through a file, looking for lines containing emails starting with 'From' and then creating a dictionary to store this emails, but also giving out the maximum occurring email address.

The line to be looked for in the files is this :

From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008

Any time this is found, the email part should be extracted out and then placed in a list before creating the dictionary.

I came upon this code sample for printing the maximum key,value in a dict:

counts = dict()  
names = ['csev','owen','csev','zqian','cwen']  
for name in names:  
  counts[name] = counts.get(name,0) + 1  
  maximum = max(counts, key = counts.get)
print maximum, counts[maximum]

From this sample code I then tried with this program:

import re

name = raw_input("Enter file:")
if len(name) < 1 : name = "mbox-short.txt"
handle = open(name)
matches = []
addy = []
counts = dict()

for lines in handle :
    # look for specific characters in document text
    if not lines.startswith("From ") : continue
    # increment the count variable for each math found
    lines.split()
    # append the required lines to the matches list
    matches.append(lines)
    # loop through the list to acess each line individually
    for email in matches :
        # place values in variable
        out = email
        # looking through each line for any email add found
        found = re.findall(r'[\w\.-]+@[\w\.-]+', out)
        # loop through the found emails and print them out
        for i in found :
            i.split()
            addy.append(i)
            for i in addy:
                counts[i] = counts.get(i, 0) + 1
                maximum = max(counts, key=counts.get)
    print counts
    print maximum, counts[maximum]

Now the issue is that there are only 27 lines starting with from and the highest recurring email in that list should be 'cwen@iupui.edu' which occurs 5 times but when i run the code my output becomes this

{'gopal.ramasammycook@gmail.com': 1640, 'louis@media.berkeley.edu': 7207, 'cwen@
iupui.edu': 8888, 'antranig@caret.cam.ac.uk': 1911, 'rjlowe@iupui.edu': 10678, '
gsilver@umich.edu': 10140, 'david.horwitz@uct.ac.za': 4205, 'wagnermr@iupui.edu'
: 2500, 'zqian@umich.edu': 16804, 'stephen.marquard@uct.ac.za': 7490, 'ray@media
.berkeley.edu': 168}    

Here's the link to the text file for better understanding : text file

Community
  • 1
  • 1
moodygeek
  • 127
  • 13
  • While I don't have an answer, this might interest you: https://pymotw.com/2/collections/counter.html – AncientSwordRage Aug 12 '16 at 09:28
  • you could also use [https://docs.python.org/2/library/collections.html#collections.defaultdict](https://docs.python.org/2/library/collections.html#collections.defaultdict), i.e., `counts = defaultdict( int )` instead of your dictinary `counts` then you could write something like `counts[email]+=1`, where email is a email address as string. – desiato Aug 12 '16 at 09:44
  • @desiato Thanks! I've seen a `defaultdict` before but wanted to keep it simple here. – Noelkd Aug 15 '16 at 10:45

4 Answers4

1

You have a couple of issues.

The first one is the for email in matches loop is being called for each line in the text file.

for lines in handle :
    # look for specific characters in document text
    if not lines.startswith("From ") : continue
    # increment the count variable for each math found
    lines.split()
    # append the required lines to the matches list
    matches.append(lines)

# loop through the list to acess each line individually
for email in matches:

So with that change you know are iterating over the matches once.

Then since we know that there are only one from in each match we can change the find to:

found = re.findall(r'[\w\.-]+@[\w\.-]+', out)[0]

To count how many of each we've seen i've changed:

# loop through the found emails and print them out
for i in found :
    i.split()
    addy.append(i)
    for i in addy:
        counts[i] = counts.get(i, 0) + 1
        maximum = max(counts, key=counts.get)

To a more readable:

if found in counts:
    counts[found] += 1
else:
    counts[found] = 1

Then you can get the max out at the end rather than saving it all the time like so:

print counts
print max(counts, key=lambda x : x[1])

Putting it togeather you get:

import re

name = raw_input("Enter file:")
if len(name) < 1 : 
    name = "mbox-short.txt"
handle = open(name)
matches = []
addy = []
counts = dict()

for lines in handle :
    # look for specific characters in document text
    if not lines.startswith("From ") : continue
    # increment the count variable for each math found
    lines.split()
    # append the required lines to the matches list
    matches.append(lines)

# loop through the list to acess each line individually
for email in matches:
    # place values in variable
    out = email
    # looking through each line for any email add found
    found = re.findall(r'[\w\.-]+@[\w\.-]+', out)[0]
    # loop through the found emails and print them out
    if found in counts:
        counts[found] += 1
    else:
        counts[found] = 1

print counts
print max(counts, key=lambda x : x[1])

Which returns:

{'gopal.ramasammycook@gmail.com': 1, 'louis@media.berkeley.edu': 3, 'cwen@iupui.edu': 5, 'antranig@caret.cam.ac.uk': 1, 'rjlowe@iupui.edu': 2, 'gsilver@umich.edu': 3, 'david.horwitz@uct.ac.za': 4, 'wagnermr@iupui.edu': 1, 'zqian@umich.edu': 4, 'stephen.marquard@uct.ac.za': 2, 'ray@media.berkeley.edu': 1}
cwen@iupui.edu
Noelkd
  • 7,686
  • 2
  • 29
  • 43
  • yes that worked well, I even came up with a solution of mine own after taking a step back and working through each piece of the initial code, when i realised that the found variable returns each emails as lists. – moodygeek Aug 12 '16 at 11:04
1
  1. lines.split() will not change lines, as in i.split(), use print to verify this temp values.
  2. check whether the for loops do as you want.

    import re
    import collections
    
    addy = []
    
    with open("mbox-short.txt") as handle:
        for lines in handle :
            if not lines.startswith("From ") : continue
            found = re.search(r'[\w\.-]+@[\w\.-]+', lines).group()
            addy.append(found.split('@')[0])
    print collections.Counter(addy).most_common(1)
    # out: [('cwen', 5)]
    
Ohad Eytan
  • 8,114
  • 1
  • 22
  • 31
citaret
  • 416
  • 4
  • 11
  • This worked thanks but the level of my python learning at the stage of the course i'm at, hasn't discussed collections yet. I'll read through the docs on collections tho as a result of this :) – moodygeek Aug 12 '16 at 14:12
0

Your loop over matches and over found do not have the right indentation. First you iterate over all lines in file and add all lines which start with "From " to matches. Afterwards you have to iterate over those matches. Similarily, for matched line you add all email addresses to addy. Afterwards you have to iterate over this list. I.e.,

for lines in handle :
    # look for specific characters in document text
    ...

for email in matches :
    ...

    for i in found :
        i.split()
        addy.append(i)

    for i in addy:
        counts[i] = counts.get(i, 0) + 1
        maximum = max(counts, key=counts.get)
desiato
  • 1,122
  • 1
  • 9
  • 16
0

Came up with an answer almost similar to that of @Noelkd, after further reflection on my code:

import re

name = raw_input("Enter file:")
if len(name) < 1 : name = "mbox-short.txt"
handle = open(name)

email_matches = []
found_emails = []
final_emails = []
counts = dict()

for lines in handle :
    # look for specific characters in document text
    if not lines.startswith("From ") : continue
    # increment the count variable for each math found
    lines.split()
    # append the required lines to the matches list
    email_matches.append(lines)

for email in email_matches :
    out = email
    found = re.findall(r'[\w\.-]+@[\w\.-]+',  out)
    found_emails.append(found)

for item in found_emails :
    count = item[0]
    final_emails.append(count)

for items in final_emails:
    counts[items] = counts.get(items,0) + 1
    maximum = max(counts, key = lambda x: counts.get(x))
print maximum, counts[maximum]

And the output

cwen@iupui.edu 5
Community
  • 1
  • 1
moodygeek
  • 127
  • 13