Although, similar type of questions have been asked by others, for ex. here, but they differed slightly and didn't really get solve my problem, so here I go again.
I have N lists (N>20,000) and each list contains M lists ( M >20,000), in the following manner ( data is dummy):
Key1: [ [4,3,1], [5,1,0] ...... [43,21,0 ] ] # List 1 with collection of M smaller lists
:
:
KeyN: [ [5,4,1], [55,1,1] ...... [ 221, 0, 0] ] # Nth list
Data is unsorted. Iterating over a list of threshold values one by one, say Threshold =[2, 3, 5, 7, 8]
, where threshold is applied over middle element, I want to extract all the elements, for all the keys, greater than the threshold value. For ex. going by the data I wrote above, Threshold = 2
would yield
For Key1: [ [4,3,1], [43,21,0]]
:
:
For KeyN: [[5,4,1]]
And similarly for other threshold values too. Since, there are too many lists, my observation is that sorting is contribute to lot of overhead and hence I want to avoid it. What is the optimum method of doing this in python ?. One additional important point is that, I am constructing the data myself, so possibly there is a better data structure to store the data to begin with. I am currently storing the data in the form of PersistentList
within a Btree
container in ZODB
, which was suggested here . Following is a snippet of the code used for it:
for Gnodes in G.nodes(): # Gnodes iterates over N values
Gvalue = someoperation(Gnodes)
for Hnodes in H.nodes(): # Hnodes iterates over N values
Hvalue =someoperation(Hnodes,Gnodes)
score = SomeOperation on (Gvalue,Hvalue)
btree_container.setdefault(Gnodes, PersistentList()).append([Hnodes, score, -1 ])
transaction.savepoint(True)
transaction.commit()
Any suggestions on what should be the most efficient way of doing it? Is sorting first indeed the optimum way ?