0

I have a long list of email addresses (8000) sorted alphabetically but there are duplicates.

With python - how can I count the number of recurrences of a unique email (count duplicates) and while maintaining one instance of the email delete the recurring duplicate emails from the list.

example list:

a@sample.com
b@sample.com
b@sample.com
b@sample.com
c@sample.com
c@sample.com

results:

a@sample.com (1)
b@sample.com (3)
c@sample.com (2)

I've searched online, but only find methods of removing duplicate numbers, dictionaries and tuples.

Guage
  • 85
  • 10

2 Answers2

1

Use itertools.groupby() for alphabetical sorted order:-

 >>>l = list of emails 
 >>>[(key, sum(1 for _ in group)) for key, group in groupby(sorted(l))]

[('a@sample.com', 1), ('b@sample.com', 3), ('c@sample.com', 2)]

Use collections.Counter to count the items that are duplicate.

>>>from collections import Counter
>>>d = Counter(['a@sample.com',
>>>'b@sample.com',
>>>'b@sample.com',
>>>'b@sample.com',
>>>'c@sample.com',
>>>'c@sample.com'])
>>>d 

Output:-

Counter({'b@sample.com': 3, 'c@sample.com': 2, 'a@sample.com': 1})

It is similar to(or in simplest way)

d = {}
for i in l: # l = list or all emails.
    if i in d:
        d[i] += 1
    else:
        d[i] = 1

or use dict.get for i in l: d[i] = d.get(i, 0) + 1

Vishnu Upadhyay
  • 5,043
  • 1
  • 13
  • 24
0

you can use collections.Counter:

>>> from collections import Counter
>>> my_email
['a@sample.com', 'b@sample.com', 'b@sample.com', 'b@sample.com', 'c@sample.com', 'c@sample.com\n']
>>> Counter(my_email)
Counter({'b@sample.com': 3, 'c@sample.com': 2, 'a@sample.com': 1})

if you want in order:

>>> sorted(Counter(my_email).items())
[('a@sample.com', 1), ('b@sample.com', 3), ('c@sample.com', 2)]

you can print like this:

>>> for x in sorted(Counter(my_email).items()):
...     print x[0],x[1]   # if you sung python 3 print(x[0],x[1])
... 
a@sample.com 1
b@sample.com 3
c@sample.com 2
Hackaholic
  • 19,069
  • 5
  • 54
  • 72