Counting and then removing duplicates from email list

Question

I have a long list of email addresses (8000) sorted alphabetically but there are duplicates.

With python - how can I count the number of recurrences of a unique email (count duplicates) and while maintaining one instance of the email delete the recurring duplicate emails from the list.

example list:

a@sample.com
b@sample.com
b@sample.com
b@sample.com
c@sample.com
c@sample.com

results:

a@sample.com (1)
b@sample.com (3)
c@sample.com (2)

I've searched online, but only find methods of removing duplicate numbers, dictionaries and tuples.

Vishnu Upadhyay · Answer 1 · 2014-12-02T13:40:15.973

Use itertools.groupby() for alphabetical sorted order:-

 >>>l = list of emails 
 >>>[(key, sum(1 for _ in group)) for key, group in groupby(sorted(l))]

[('a@sample.com', 1), ('b@sample.com', 3), ('c@sample.com', 2)]

Use collections.Counter to count the items that are duplicate.

>>>from collections import Counter
>>>d = Counter(['a@sample.com',
>>>'b@sample.com',
>>>'b@sample.com',
>>>'b@sample.com',
>>>'c@sample.com',
>>>'c@sample.com'])
>>>d

Output:-

Counter({'b@sample.com': 3, 'c@sample.com': 2, 'a@sample.com': 1})

It is similar to(or in simplest way)

d = {}
for i in l: # l = list or all emails.
    if i in d:
        d[i] += 1
    else:
        d[i] = 1

or use dict.get for i in l: d[i] = d.get(i, 0) + 1

`d[i] = d.get(i, 0) + 1` is a better idiom. – Sriram Dec 02 '14 at 13:00 — Sriram, Dec 02 '14 at 13:00

Hackaholic · Answer 2 · 2014-12-02T12:49:50.113

0

you can use collections.Counter:

>>> from collections import Counter
>>> my_email
['a@sample.com', 'b@sample.com', 'b@sample.com', 'b@sample.com', 'c@sample.com', 'c@sample.com\n']
>>> Counter(my_email)
Counter({'b@sample.com': 3, 'c@sample.com': 2, 'a@sample.com': 1})

if you want in order:

>>> sorted(Counter(my_email).items())
[('a@sample.com', 1), ('b@sample.com', 3), ('c@sample.com', 2)]

you can print like this:

>>> for x in sorted(Counter(my_email).items()):
...     print x[0],x[1]   # if you sung python 3 print(x[0],x[1])
... 
a@sample.com 1
b@sample.com 3
c@sample.com 2

edited Dec 02 '14 at 12:49

answered Dec 02 '14 at 12:32

Hackaholic

19,069
5
54
72

Subsequent sort is required. However, 8000 entries are nothing before built-in sort. – Vladimir Dec 02 '14 at 12:34
If you're going to sort based on keys then the `key` call is useless, a dict always has unique keys. – Ashwini Chaudhary Dec 02 '14 at 12:51
yea right, its useless – Hackaholic Dec 02 '14 at 12:53

Counting and then removing duplicates from email list

2 Answers2