1

I have to compare two lists of dicts such as below:

main = [{'id': 1,'rate': 13,'type'= 'C'}, {'id': 2,'rate': 39,'type': 'A'}, ...]
compare = [{'id': 119, 'rate': 33, 'type': 'D'}, {'id': 120, 'rate': 94, 'type': 'A'}, ...]

for m in main:
  for c in compare:
     if (m['rate'] > c['rate']) and (m['type'] == c['type']):
          # ...

The lists have around 9,000 items. The above code runs around 81,000,000 times (9,000 * 9,000). How can I speed this up?

Tomerikoo
  • 18,379
  • 16
  • 47
  • 61
devnext
  • 872
  • 1
  • 10
  • 25
  • 1
    its late on a Friday so Im just going to throw this out there, Lambdas is what you want. – DUDANF Sep 20 '19 at 16:00
  • If you sort `compare` by rate, you can short-circuit the inner loop. – Scott Hunter Sep 20 '19 at 16:03
  • Do you _want_ to compare each element in `main` to each element in `compare`? – John Gordon Sep 20 '19 at 16:05
  • 1
    There is no algorithmic optimization. You can only use 'syntax sugar' or optimize it some different way such as lambdas – Pavel Antspovich Sep 20 '19 at 16:06
  • Of course it runs 81M times. You should not worry about CPU cycles. Only worry about time spent. Do the 81M loops run in 0.01 secs? No problem. Does it take 2 days? Problem. – Thomas Weller Sep 20 '19 at 16:09
  • @PavelAntspovich: well, maybe that statement is too simple to be true. It could be more efficient to first group all items by type and then compare only the ones where the type matches. In addition, sorting by rate might help as well. – Thomas Weller Sep 20 '19 at 16:12
  • @ThomasWeller, I doubt very much that sorting expenses will cover comparison ones. And there are minimum two nested for loops and one if statement – Pavel Antspovich Sep 20 '19 at 16:17

4 Answers4

3

You could first sort or split the lists by type and perform the comparisons per type only. The question then is: how many operations do you need for sorting or splitting and how many for comparison. Remember that there are quite efficient sort algorithms.

The next optimizazion could be sorting by rate. That way you can break the loop when the condition m['rate'] > c['rate'] is not satisfied any more. In fact, you can even do a command and conquer algorithm.

Last not least, you might benefit from Why is processing a sorted array faster than processing an unsorted array?, which is not an algorithmic improvement, but can still make a huge difference.

Let me generate a dataset with 9000 items (in the future, you may want to include such a thing in your question, since it makes our life easier):

import random
types = ["A", "B", "C", "D", "E", "F"]
main=[]
compare = []
for i in range(9000):
    main.append({'id':random.randint(0,20000), 'rate':random.random()*500, 'type':types[random.randint(0,5)]})
    compare.append({'id': random.randint(0, 20000), 'rate': random.random() * 500, 'type': types[random.randint(0, 5)]})

Running this with a loop like

import time
start = time.time()
cycles = 0
for m in main:
  for c in compare:
      cycles += 1
      if (m['rate'] > c['rate']) and (m['type'] == c['type']):
          pass
end = time.time()
print("Total number of cycles "+str(cycles))
print("Seconds taken: " + str(end - start))

it results (on my machine) in 81M cycles and ~30 seconds.

Splitting by type might look like this:

# Split by types
mainsplit = {}
compsplit = {}
for t in types:
    cycles += 1
    mainsplit[t] = []
    compsplit[t] = []
for m in main:
    cycles += 1
    mainsplit[m["type"]].append(m)
for c in compare:
    cycles += 1
    compsplit[c["type"]].append(c)

# Then go through it by type
for t in types:
    for m in mainsplit[t]:
        for c in compsplit[t]:
            cycles += 1
            if m['rate'] > c['rate']:
                pass

This gives ~14M cycles and only ~4 s.

Sorting the partial results by "rate" and finding a lower limit for "rate" :

# Then go through it by type
for t in types:
    mainsplit[t].sort(key=lambda i:i["rate"])
    compsplit[t].sort(key=lambda i:i["rate"])
    start_of_m_in_c = 0
    for m in mainsplit[t]:
        for nc in range(start_of_m_in_c, len(compsplit[t])):
            cycles += 1
            if m["rate"] > compsplit[t][nc]["rate"]:
                pass
            else:
                start_of_m_in_c = nc

Cycles is now 36000 (not counting the cycles used by the sort algorithm) and the time to 30 ms.

All in all, that's a performance increase of a factor 1000.

Thomas Weller
  • 55,411
  • 20
  • 125
  • 222
1

Given:

main = [
    {'id': 1, 'rate': 13, 'type': 'C'},
    {'id': 2, 'rate': 39, 'type': 'A'},
    {'id': 3, 'rate': 94, 'type': 'A'},
    {'id': 4, 'rate': 95, 'type': 'A'},
    {'id': 5, 'rate': 96, 'type': 'A'}
]
compare = [
    {'id': 119, 'rate': 33, 'type': 'D'},
    {'id': 120, 'rate': 94, 'type': 'A'}
]

You can first map the two lists of dicts into two dicts of lists of dicts indexed by type, and sort sub-lists by rate:

mappings = []
for lst in main, compare:
    mappings.append({})
    for entry in lst:
        mappings[-1].setdefault(entry['type'], []).append(entry)
    for entries in mappings[-1].values():
        entries.sort(key=lambda entry: entry['rate'])
main, compare = mappings

so that main becomes:

{'C': [{'id': 1, 'rate': 13, 'type': 'C'}],
 'A': [{'id': 2, 'rate': 39, 'type': 'A'},
       {'id': 3, 'rate': 94, 'type': 'A'},
       {'id': 4, 'rate': 95, 'type': 'A'},
       {'id': 5, 'rate': 96, 'type': 'A'}]}

while compare becomes:

{'D': [{'id': 119, 'rate': 33, 'type': 'D'}],
 'A': [{'id': 120, 'rate': 94, 'type': 'A'}]}

so that you iterate through the matching types of the two dicts in linear time, and use bisect to find the index in each sub-list of main where the rate is greater than that of compare, which takes a time complexity of O(log n), and then iterate through the rest of the sub-list from that index for processing. Overall this algorithm is of O(n log n) in time complexity, an improvement over the O(n ^ 2) time complexity of your original code:

from bisect import bisect

for type in main.keys() & compare.keys():
    for entry in compare[type]:
        main_entries = main[type]
        for match in main_entries[bisect([d['rate'] for d in main_entries], entry['rate']):]:
            print(match['id'], entry['id'])

This outputs:

4 120
5 120

Demo: https://repl.it/repls/EasygoingReadyTechnologies

Disclaimer: This may look like an implementation of @ThomasWeller's solution but I actually did not see his answer until I finished my coding, which was interrupted by my other work. Also @ThomasWeller wants to sort the two lists by type, which would incur an O(n log n) time complexity, when it can be done in linear time as shown in the for entry in lst loop in my code.

blhsing
  • 91,368
  • 6
  • 71
  • 106
0

This looks like a job for sqlite - it's the kind of thing databases are totally optimized for. Python has very nice bindings to sqlite, so it should fit nicely.

Here's a starting point...

import sqlite3

c = None
try:
    c = sqlite3.connect(':memory:')
    c.execute('create table main ( id integer primary key, rate integer not null,   type text not null );')
    main = [{'id': 1,'rate': 13,'type': 'C'}, {'id': 2,'rate': 39,'type': 'A'}]
    for e in main:
        c.execute('insert into main (id, rate, type) VALUES (' + str(e['id']) + ',  ' +
                    str(e['rate']) + ',\"' + e['type'] + '\")')
    # now for the query
    # exercise left for the OP (but does require some SQL expertise)
except Error as e:
    print(e)
finally:
    if c:
        c.close()
Dylan McNamee
  • 1,696
  • 14
  • 16
0

You can use PyPy interpretator instead of classic Cpython. It can give you abaout 80% speedup

Pavel Antspovich
  • 1,111
  • 1
  • 11
  • 27