0

I need help with this python code. I am making an application that will read the mailbox data (mbox.txt) and count the number of email messages per organization (i.e. domain name of the email address) using a database with the following schema to maintain the counts. The top organizational count is 536.

This is the Schema: CREATE TABLE Counts (org TEXT, count INTEGER)

I've tried so many times I just can't get the count of 536. Here's my code below:

import sqlite3

conn = sqlite3.connect('emaildb.sqlite')
cur = conn.cursor()

cur.execute('DROP TABLE IF EXISTS Counts')

cur.execute('''
CREATE TABLE Counts (org TEXT, count INTEGER)''')

fname = input('Enter file name: ')
if (len(fname) < 1): fname = 'mbox.txt'

fh = open(name)
for line in fh:
    if not line.startswith('From: '): continue
    pieces = line.split()
    org = pieces[1]
    cur.execute('SELECT count FROM Counts WHERE org = ? ', (org,))
    row = cur.fetchone()
    if row is None:
       cur.execute('''INSERT INTO Counts (org, count)
              VALUES (?, 1)''', (org,))
    else:
       cur.execute('UPDATE Counts SET count = count + 1 WHERE org = ?',
                    (org,))
conn.commit()

# https://www.sqlite.org/lang_select.html
sqlstr = 'SELECT org, count FROM Counts ORDER BY count DESC LIMIT 10'

for row in cur.execute(sqlstr):
    print(str(row[0]), row[1])

cur.close()

The highest number that I got is 195. Here is the output of the code above:

Enter the file name: 
zqian@umich.edu 195 
mmmay@indiana.edu 161
cwen@iupui.edu 158
chmaurer@iupui.edu 111
aaronz@vt.edu 110
ian@caret.cam.ac.uk 96
jimeng@umich.edu 93
rjlowe@iupui.edu 90
dlhaines@umich.edu 84
david.horwitz@uct.ac.za 67

Here's the link where I got the text file and wrote it to a text file called mbox.txt (https://www.py4e.com/code3/mbox.txt)

2 Answers2

0

You're not extracting the domain from the email. So multiple emails at the same domain are being treated as different organizations.

for line in fh:
    if not line.startswith('From: '): continue
    pieces = line.split()
    email = pieces[1]
    pieces = email.splot('@')
    org = pieces[1]
    ...

Also, you might want to use the code in SQLite INSERT - ON DUPLICATE KEY UPDATE (UPSERT) so you don't have to do a SELECT query to see if the organization already exists.

Barmar
  • 741,623
  • 53
  • 500
  • 612
0

Your retrieved results are email addresses, not email domains. You have to split the email addresses at the '@' symbol to get domain names:

if not line.startswith('From: '): 
    continue
pieces = line.split('@') # this is what you want
org = pieces[1]
cur.execute('SELECT count FROM Counts WHERE org = ? ', (org,))

Explanation: instead of splitting the string at every space, which is the default behaviour of the Python str.split() function, we split the string at the '@' sign. So an line in your text file like 'From: name@email.com' would become a list with two parts: ['From: name', 'email.com']

Then you can use the second part and keep track of that part instead, and hopefully the code will work.

Leo Qi
  • 557
  • 5
  • 13