I have a text file with two column name|familyname in which a name can have different family so, we have multiple rows with same name and different familynames. The file is around50GB. What I want is the number of familynames per name.
currently I created a dictionary in with names as keys and family name as values, and I am printing out each key and length of the value (as a set of family names). But this is not really efficient and quick
d = defaultdict(set)
f = open(file, 'r')
for n, line in enumerate(f):
name,family= line.split('|')
d[name].add(family)
for name, family in d.iteritems():
print("%s|%s" % (name, len(family)), file = w)
Does any body have any suggestion for a quicker method of getting the same result?