0

I have a text file with two column name|familyname in which a name can have different family so, we have multiple rows with same name and different familynames. The file is around50GB. What I want is the number of familynames per name.

currently I created a dictionary in with names as keys and family name as values, and I am printing out each key and length of the value (as a set of family names). But this is not really efficient and quick

d = defaultdict(set)


f = open(file, 'r')
for n, line in enumerate(f):
    name,family= line.split('|')
    d[name].add(family)


for name, family in d.iteritems():
    print("%s|%s" % (name, len(family)), file = w)

Does any body have any suggestion for a quicker method of getting the same result?

UserYmY
  • 8,034
  • 17
  • 57
  • 71
  • 50gb of text? How many lines does this file has? Just curious. – DevLounge Oct 29 '14 at 21:22
  • In your shell: wc -l – DevLounge Oct 29 '14 at 21:25
  • yeah I know but it would take a lot of time also and how is that useful? – UserYmY Oct 29 '14 at 21:28
  • Anyway, unless you inject such file into a database, you'll have to go through each line and then do as you do. It will takes ages as the file is too huge for optimal computation. IMO – DevLounge Oct 29 '14 at 21:29
  • http://stackoverflow.com/questions/16669428/process-very-large-20gb-text-file-line-by-line – DevLounge Oct 29 '14 at 21:33
  • Reading 50GB of text is going to take a while. You're very likely I/O blocked more than anything else; there's very little you can do algorithmically that will alleviate this. If you have a lot of unique names, you're also going to find that printing to the screen is abominably slow--try piping the output to a file to speed up the output reporting. – Henry Keiter Oct 29 '14 at 22:10
  • Do you know that there are no identical lines? If there are identical lines, how do you want them counted? – Robᵩ Oct 30 '14 at 01:15
  • Why do you call `enumerate`? Wouldn't `for line in f:` suffice? – Robᵩ Oct 30 '14 at 01:16
  • [This question](http://stackoverflow.com/questions/26422507/count-unique-values-per-unique-keys-in-python-dictionary?rq=1)'s accepted answer is identical to what you've started with. – Robᵩ Oct 30 '14 at 01:28
  • no there are no identical lines. I already preprocessed that. – UserYmY Oct 30 '14 at 07:18

1 Answers1

0

One alternative is to use collections.Counter. This will double-count identical lines, but it might go faster:

import collections

with open('input.txt', 'r') as f:
    d = collections.Counter(line.split('|',1)[0] for line in f)

print d.most_common(5)
Robᵩ
  • 163,533
  • 20
  • 239
  • 308
  • This does not work . I get this error:AttributeError: 'module' object has no attribute 'counter' – UserYmY Oct 30 '14 at 09:05
  • 1) Are you using Python version 2.7 or higher? `collections.Counter` was introduced in 2.7. 2) Are you spelling `Counter` correctly? It starts with a capital `C`, and your error message has a lower-case `c`. – Robᵩ Oct 30 '14 at 12:32