I have a list of lists of tuples. Each tuple has the form (string,int)
, e.g.
lst = list()
lst.append([('a',5),('c',10),('d',3),('b',1)])
lst.append([('c',14),('f',4),('b',1)])
lst.append([('d',22),('f',2)])
Think of the int
's as counts of each string in different blocks of text.
What I need to do is produce a list of top-N
occurring strings together with their cumulative counts. So in the example above, a
appears 5 times, b
appears twice, c
appears 24 times etc. If N=2
, then I would have to produce either a pair of parallel lists ['d','c']
and [25,24]
or a list of tuples [('d',25),('c',24)]
. I need to do it as quickly as possible. My machine has lots of RAM so memory is not an issue.
I have this implementation:
import numpy as np
def getTopN(lst,N):
sOut = []
cOut = []
for l in lst:
for tpl in l:
s = tpl[0]
c = tpl[1]
try:
i = sOut.index(s)
cOut[i] += c
except:
sOut.append(s)
cOut.append(c)
sIndAsc = np.argsort(cOut).tolist()
sIndDes = sIndAsc[::-1]
cOutDes = [cOut[sir] for sir in sIndDes]
sOutDes = [sOut[sir] for sir in sIndDes]
return sOutDes[0:N],cOutDes[0:N]
There's gotta be a better way, but what would it be?