Quickest way to count number of unique values per key?

Question

I have a text file with two column name|familyname in which a name can have different family so, we have multiple rows with same name and different familynames. The file is around50GB. What I want is the number of familynames per name.

currently I created a dictionary in with names as keys and family name as values, and I am printing out each key and length of the value (as a set of family names). But this is not really efficient and quick

d = defaultdict(set)


f = open(file, 'r')
for n, line in enumerate(f):
    name,family= line.split('|')
    d[name].add(family)


for name, family in d.iteritems():
    print("%s|%s" % (name, len(family)), file = w)

Does any body have any suggestion for a quicker method of getting the same result?

50gb of text? How many lines does this file has? Just curious. — DevLounge, Oct 29 '14 at 21:22
yeah I know but it would take a lot of time also and how is that useful? — UserYmY, Oct 29 '14 at 21:28
Anyway, unless you inject such file into a database, you'll have to go through each line and then do as you do. It will takes ages as the file is too huge for optimal computation. IMO — DevLounge, Oct 29 '14 at 21:29
http://stackoverflow.com/questions/16669428/process-very-large-20gb-text-file-line-by-line — DevLounge, Oct 29 '14 at 21:33
Reading 50GB of text is going to take a while. You're very likely I/O blocked more than anything else; there's very little you can do algorithmically that will alleviate this. If you have a lot of unique names, you're also going to find that printing to the screen is abominably slow--try piping the output to a file to speed up the output reporting. — Henry Keiter, Oct 29 '14 at 22:10
Do you know that there are no identical lines? If there are identical lines, how do you want them counted? — Robᵩ, Oct 30 '14 at 01:15
Why do you call `enumerate`? Wouldn't `for line in f:` suffice? — Robᵩ, Oct 30 '14 at 01:16
[This question](http://stackoverflow.com/questions/26422507/count-unique-values-per-unique-keys-in-python-dictionary?rq=1)'s accepted answer is identical to what you've started with. — Robᵩ, Oct 30 '14 at 01:28
no there are no identical lines. I already preprocessed that. — UserYmY, Oct 30 '14 at 07:18

score 0 · Answer 1 · answered Oct 30 '14 at 01:22

0

One alternative is to use collections.Counter. This will double-count identical lines, but it might go faster:

import collections

with open('input.txt', 'r') as f:
    d = collections.Counter(line.split('|',1)[0] for line in f)

print d.most_common(5)

answered Oct 30 '14 at 01:22

Robᵩ

163,533
20
239
308

This does not work . I get this error:AttributeError: 'module' object has no attribute 'counter' – UserYmY Oct 30 '14 at 09:05
1) Are you using Python version 2.7 or higher? `collections.Counter` was introduced in 2.7. 2) Are you spelling `Counter` correctly? It starts with a capital `C`, and your error message has a lower-case `c`. – Robᵩ Oct 30 '14 at 12:32

Quickest way to count number of unique values per key?

1 Answers1