Fastest way to count three items referenced by one of them (using python)?

Question

I have a multi column (13 columns) space separated file (some 5 million+ lines), going like this:

 1. W5 403  407 P Y 2 2 PR 22  PNIYR 22222 12.753 13.247
 2. W5 404  408 N V 2 2 PR 22  PIYYR 22222 13.216 13.247
 3. W3 274  276 E G 1 1 EG 11  EPG 121 6.492 6.492
 4. W3 275  277 P R 2 1 PR 21  PGR 211 6.365 7.503
 5. W3 276  278 G Y 1 1 GY 11  GRY 111 5.479 5.479
 6. W3 46  49 G L 1 1 GY 11  GRY 111 5.176 5.176
 7. W4 47  50 D K 1 1 DK 11  DILK 1111 4.893 5.278
 8. W4 48  51 I K 1 1 IK 11  ILKK 1111 4.985 5.552

etc., etc.,

I'm interested in 2 of these columns (col 8 & 11) and want to count the number of occurrences of particular pairs (col 8) with the strings that follow (in col 11).

Ex., reference key 'GY' : # of occurrences of '111' : 2 key 'PR' : # of occurrences of '22222': 2 key 'DK' : # of occurrences of '1111' :1 key 'EG' : # of occurrences of '121': 1

I have a dict based basic implementation of it.

countshash={}
for l in bigtable:
          cont = l.split()
          if cont[7] not in countshash: countshash[cont[7]] = {}
          if cont[11] not in countshash[cont[7]]: countshash[cont[7]][cont[10]] = 0
          countshash[cont[7]][cont[10]]+= 1;

I also have a simple awk based counting (which is super-fast) but was wondering about an efficient & faster way to do this in python. Thanks for your inputs.

@tMC: user17177 is asking if there is any faster way of doing it. — Blender, Aug 21 '12 at 20:57
@user17177: Can you post a sample line from `bigtable` (like an actual Python object)? — Blender, Aug 21 '12 at 21:05
related: [Python - Is a dictionary slow to find frequency of each character?](http://stackoverflow.com/questions/2522152/python-is-a-dictionary-slow-to-find-frequency-of-each-character) — jfs, Aug 21 '12 at 21:22
@Blender the above table itself is a sample. if you want, can post more lines of it, as an object. — nahsivar, Aug 22 '12 at 07:07

score 4 · Answer 1 · answered Aug 21 '12 at 21:02

I'm not sure if this will help with speed, but you are creating a ton of defaultdict-like objects, which I think you can make a bit more readable:

from collections import defaultdict

countshash = defaultdict(lambda: defaultdict(int))

for l in bigtable:
    cont = l.split()
    countshash[cont[7]][cont[10]] += 1

John La Rooy · Answer 2 · 2012-08-21T22:44:12.813

4

from collections import Counter
Counter(tuple(row.split()[8:12:3]) for row in bigtable)

using itemgetter is more flexible and may be more efficient than slicing

from operator import itemgetter
ig = itemgetter(8, 11)
Counter(ig(row.split()) for row in bigtable)

using imap can make things a tiny big faster too

from itertools import imap
Counter(imap(ig, imap(str.split, bigtable)))

edited Aug 21 '12 at 22:44

answered Aug 21 '12 at 21:20

John La Rooy

295,403
53
369
502

Is slicing really faster than direct index access? – Silas Ray Aug 21 '12 at 21:25
@sr2222, using `itemgetter` seems faster for me – John La Rooy Aug 21 '12 at 22:37

offlinehacker · Answer 3 · 2012-08-22T07:23:40.937

1

Well you are doing double lookup. You could just do countshash[(cont[7],count[10])]+=1, this could be faster, but depends on how python implements it. Memory footprint should be slightly bigger.

Something simple like:

countshash=defaultdict(int)
for l in bigtable:
          cont = l.split()
          countshash[(cont[7],cont[10])]+= 1;

edited Aug 22 '12 at 07:23

answered Aug 21 '12 at 21:03

offlinehacker

842
1
9
14

1

And as long as you can guarantee that you can avoid name confusion (ex never having PR1/111 and PR/1111). – Silas Ray Aug 21 '12 at 21:06
agree, but you can still replace order if it solves the problem. – offlinehacker Aug 21 '12 at 21:10
Or you could just drop the strings in to a tuple and use that as the dict key instead of concatenating the strings. That would probably make your life easier if/when you want to read data back out of this structure too. – Silas Ray Aug 21 '12 at 21:12
By the way, [here](http://www.laurentluce.com/posts/python-dictionary-implementation/) is an article how python dicts are implemented. – offlinehacker Aug 21 '12 at 22:07
Interesting. Is there a way to optimize performance by explicitly setting start size of the hash array? It seems you would need a custom cType for that... – Silas Ray Aug 21 '12 at 22:25
`d={}; d['a'] += 1` raises KeyError. You could use [`defaultdict`](http://stackoverflow.com/a/12063356/4279) to fix it. Also `()` are unnecessary to create a 2-item tuple. – jfs Aug 21 '12 at 23:46
@offlinehacker, as pointed out the catch in the above code is that dict needs to be initialized first with Keys. defaultdict works well. – nahsivar Aug 22 '12 at 07:11

score 1 · Answer 4 · answered Aug 21 '12 at 21:31

1

from collections import defaultdict

countshash = defaultdict(int)
for l in bigtable:
    cont = l.split()
    countshash[cont[7], cont[10]] += 1

answered Aug 21 '12 at 21:31

jfs

399,953
195
994
1,670

based on the simple tests i did, the speed for this seem comparable to the basic dict based method whereas the above lambda implementation is a tad faster. needs more thorough profiling though. – nahsivar Aug 22 '12 at 08:14
@user17177: It is surprising. This is why profiling with your own data, in your environment is useful. – jfs Aug 22 '12 at 11:55

Fastest way to count three items referenced by one of them (using python)?

4 Answers4