3

I have 2 lists of dictionaries. List A is 34,000 long, list B is 650,000 long. I am essentially inserting all the List B dicts into the List A dicts based on a key match. Currently, I am doing the obvious, but its taking forever (seriously, like a day). There must be a quicker way!

for a in listA:
    a['things'] = []
    for b in listB:
        if a['ID'] == b['ID']:
            a['things'].append(b)
MFB
  • 19,017
  • 27
  • 72
  • 118
  • Did you check out this question: http://stackoverflow.com/questions/38987/how-can-i-merge-two-python-dictionaries-as-a-single-expression - I'm not sure whether you're trying to overwrite the contents of listA with listB but if so, that's your answer. – mal-wan Sep 06 '11 at 23:37
  • Thanks mwan100 but I'm inserting listB items into listA items, not overwriting. Also, that poster admits its not a great solution if performance is an issue – MFB Sep 06 '11 at 23:39
  • Are the values of a["ID"] unique? – DSM Sep 06 '11 at 23:44
  • DSM, only on the A side. The dicts are from a db where a['ID'] is primary key. Many B's for One A. – MFB Sep 06 '11 at 23:47
  • Have a look here - http://wiki.python.org/moin/PythonSpeed/PerformanceTips. The section on 'Avoiding dots' and 'Local variables' might help. Along with the answer below. – arunkumar Sep 07 '11 at 00:01
  • Thanks arunkumar.. helpful link – MFB Sep 07 '11 at 00:29

3 Answers3

4
from collections import defaultdict
dictB = defaultdict(list)
for b in listB:
    dictB[b['ID']].append(b)

for a in listA:
    a['things'] = []
    for b in dictB[a['ID']]:
        a['things'].append(b)

this will turn your algorithm from O(n*m) to O(m)+O(n), where n=len(listA), m=len(listB)

basically it avoids looping through each dict in listB for each dict in listA by 'precalculating' what dicts from listB match each 'ID'

agf
  • 171,228
  • 44
  • 289
  • 238
Mihai Stan
  • 1,052
  • 6
  • 7
  • dictB.setdefault([]) raises TypeError: unhashable type: 'list' – MFB Sep 07 '11 at 00:54
  • @MFB It's `setdefault('whateverthekeyis', [])`. – agf Sep 07 '11 at 02:10
  • @MFB actually it looks what he means here is `from collections import defaultdict; dictB = defaultdict(list)`. I'm editing the answer to reflect this. `setdefault` needs to be used every time you do a key lookup, so it would be i.e. `dictB.setdefault(b['ID'], []).append(b)`. – agf Sep 07 '11 at 03:27
1

Here's an approach that may help. I'll leave it to you to fill in the details.

Your code is slow because it is a O(n^2) algorithm, comparing every A against every B.

If you sort each of listA and listB by id first (these are O(nlogn)) operations, then you can iterate easily through the sorted versions of A and B (this will be in linear time).

This approach is common when you have to do external merges on very large data sets. Mihai's answer is better for internal merging, where you simply index everything by id (in memory). If you have the memory to hold these additional structures, and dictionary lookup is constant time, that approach will likely be faster, not to mention simpler. :)

By way of example let's say A had the following ids after sorting

acfgjp

and B had these ids, again after sorting

aaaabbbbcccddeeeefffggiikknnnnppppqqqrrr

The idea is, strangely enough, to keep indexes into A and B (I know that does not sound very Pythonic). At first you are looking at a in A and a in B. So you walk through B adding all the a's to your "things" array for a. Once you exhaust the a's in B, you move up one in A, to c. But the next item in B is b, which is less than c, so you have to skip the b's. Then you arrive at a c in B, so you can start adding into "things" for c. Continue in this fashion until both lists are exhausted. Just one pass. :)

Ray Toal
  • 86,166
  • 18
  • 182
  • 232
  • Ray, thanks for the excellent response. Can I clarify something? Once I have each list sorted by ID, do I revert to my method, or employ something else? – MFB Sep 07 '11 at 00:29
0

I'd convert ListA and ListB into dictionaries instead, dictionaries with ID as the key. Then it is a simple matter to append data using python's quick dictionary lookups:

from collections import defaultdict

class thingdict(dict):
    def __init__(self, *args, **kwargs):
        things = []
        super(thingdict,self).__init__(*args, things=things, **kwargs)

A = defaultdict(thingdict)
A[1] = defaultdict(list)
A[2] = defaultdict(list, things=[6])  # with some dummy data
A[3] = defaultdict(list, things=[7])

B = {1: 5, 2: 6, 3: 7, 4: 8, 5: 9}

for k, v in B.items():
    # print k,v
    A[k]['things'].append(v)

print A
print B

This returns:

defaultdict(<class '__main__.thingdict'>, {
    1: defaultdict(<type 'list'>, {'things': [5]}),
    2: defaultdict(<type 'list'>, {'things': [6, 6]}),
    3: defaultdict(<type 'list'>, {'things': [7, 7]}),
    4: {'things': [8]},
    5: {'things': [9]}
})
{1: 5, 2: 6, 3: 7, 4: 8, 5: 9}
Gringo Suave
  • 29,931
  • 6
  • 88
  • 75