4

I'm experiencing serious slow downs when working with dictionaries as the dictionary grows to a few thousands keys.

I'm processing a file with ~1,000,000 lines of data, I'm constructing a graph like data structure using dictionaries

here's my bottle neck function

def create_edge(node_a, node_b, graph):
    if node_a not in graph.keys():
        graph[node_a] = {node_b: 1}
    elif node_b in graph[node_a].keys():
        graph[node_a][node_b] += 1
    else:
        graph[node_a][node_b] = 1

create_edge will create and edge from node_a to node_b, or add 1 to the weight of an already existing edge between them.

Since my nodes are identified by a string unique id, I'm using dictionaries for storage, assuming that search if a key exist, and insert would take O(1) on average.

If I comment out create_edge I can process around 20,000 records per seconds, with create_edge as a part of my pipeline it's about 20 records per sec.

The first 100 records takes about 500ms to process. When the dictionary size grows to around 10,000 - Processing 100 records take about 15,000ms, every record process invokes create_edge about 4 times on average - So 400 calls for create_edge takes 15 seconds when the dictionary size is 10,000.

First, do these runtimes make sense? Seems way to much to me, correct me if I'm wrong.

Second, suggestions of optimizing the dictionary usage for better run time would be greatly appreciated.

I'm expecting the dictionary size to be at least 100,000 to complete processing for the entire 1,000,000 records.


Edit: Conclusions

You were right on the money, did two noob mistakes here.. :)

The keys() calls increase complexity dramaticly, taking it from constant time to poly time (squared) per edge insertion, replacing if node in graph.keys() with if node in graph produces a constant process time of 100 records in ~300ms.

Second mistake was virtualenv config which led me to believe im using python3 while I was actualy using python2.

python3 do optimizes the keys() code into a constant time search, which is good for run time but less for proper code style.

Thanks a lot for your assistance.


I performed a run time comparison after removing the keys() calls.

# graph = {}
python version: 3.6.3
start time 11:44:56
Number of records: 1029493
graph created, 1231630 nodes
end time 11:50:35
total ~05:39

# graph = defaultdict(lambda : defaultdict(int))
python version: 3.6.3
start time 11:54:52
Number of records: 1029493
graph created, 1231630 nodes
end time  12:00:34
total ~05:42

# graph = {}
python version: 2.7.10
start time 12:03:25
Number of records: 1029493
graph created, 1231630 nodes
end time 12:09:40
total ~06:15
Aviran
  • 5,160
  • 7
  • 44
  • 76

3 Answers3

3

Can't u just use a defaultdict with defaultdict(int) as seen here: Python: defaultdict of defaultdict?

from collections import defaultdict

graph = defaultdict(lambda : defaultdict(int))

graph['a']['b'] += 1
graph['a']['b'] += 1
graph['a']['c'] += 1

graph

Returns:

defaultdict(<function __main__.<lambda>>,
            {'a': defaultdict(int, {'b': 2, 'c': 1})})
# equal to: {'a': {'b': 2, 'c': 1}}
Anton vBR
  • 18,287
  • 5
  • 40
  • 46
  • Is that actually faster than the OP's (after fixing their `.keys()` mistake, of course). Can you share your benchmark results? – Stefan Pochmann Jan 01 '18 at 00:16
  • Since I’m not sitting on the data set, I thought OP could do it. – Anton vBR Jan 01 '18 at 00:20
  • @StefanPochmann It will likely be much faster. Context switching (calling `create_edge`) is pretty expensive. – Patrick Haugh Jan 01 '18 at 00:34
  • 1
    @PatrickHaugh I don't see a big difference, have a look at the times at the bottom: https://ideone.com/SajEsd. And that's with actually adding edges with that function. If I do it inline, the OP's way is significantly *faster* in my test: https://ideone.com/DvufOu – Stefan Pochmann Jan 01 '18 at 00:45
  • will definitly try using lambdas and defaultdict and post results, thanks! – Aviran Jan 01 '18 at 07:05
2

When testing for the existence of a key in a dict, just use key in d, rather than key in d.keys(). Extracting the keys to test for membership negates the benefit of using a dict in the first place.

Try the following:

def create_edge(node_a, node_b, graph):
    if node_a not in graph:
        graph[node_a] = {node_b: 1}
    elif node_b in graph[node_a]:
        graph[node_a][node_b] += 1
    else:
        graph[node_a][node_b] = 1

Notice that keys() isn't called at all. This should be a lot faster than what you're doing now.

Note that in Python 2 the keys() check will be much slower than in Python 3, since in Python 2 keys() creates a list of the entire set of keys. It works differently in Python 3, but even in Python 3 checking for membership directly, without using keys(), will be faster.

Tom Karzes
  • 22,815
  • 2
  • 22
  • 41
  • good explanation about the keys() part but without defaultdict/Counter it'll still be slow. – Jean-François Fabre Jan 01 '18 at 00:09
  • @Jean-FrançoisFabre A `defaultdict` will be faster, but the biggest problem in the code was that it was using `keys()`, especially if OP was using Python 2. Using a `defaultdict` might at best triple the speed by avoiding redundant key checks, but it does have the advantage that it eliminates the possibility of calling `keys()` by mistake. – Tom Karzes Jan 01 '18 at 00:16
  • @Jean-FrançoisFabre Please show your benchmark supporting that statement. – Stefan Pochmann Jan 01 '18 at 00:18
  • @Jean-FrançoisFabre Mine disagrees, defaultdict+Counter taking almost twice as long as the OP's (fixed) way: https://ideone.com/DvufOu – Stefan Pochmann Jan 01 '18 at 00:29
  • @StefanPochmann Interesting. I guess `defaultdict` is less efficient than I gave it credit for. – Tom Karzes Jan 01 '18 at 00:55
  • @TomKarzes Hmm, I just realized my test wasn't quite what it should be, since I used unique edges only. Here's a version where I have each edge ten times, and there the defaultdict[defaultdict[int]] wins (but defaultdict[Counter] is still the loser): https://ideone.com/ckF0X5 – Stefan Pochmann Jan 01 '18 at 01:07
  • @StefanPochmann Thank you for the tests! This confirms defaultdict(int) is the fastest approach in some cases and a big plus the lower amount of code :) – Anton vBR Jan 01 '18 at 09:40
  • @StefanPochmann hashing of the keys makes a difference here. With string keys, the hashing is more expensive, so `defaultdict` wins. – Jean-François Fabre Jan 01 '18 at 09:58
  • @Jean-FrançoisFabre In my latter test, `defaultdict` already won, so I guess you're talking about my former test? I changed that to use string keys, and `dict` still won: https://ideone.com/YLm53K Can you please share your test code? – Stefan Pochmann Jan 01 '18 at 11:12
  • @TomKarzes I'd btw replace "This should be a bit faster" with "This should be a lot faster". – Stefan Pochmann Jan 01 '18 at 11:20
2

I tried several methods and this is one that seems to work. This method use a counter to count all occurances first and then build the dictionary. Thanks @Stefan Pochmann to provide benchmark scripts. One I used is from ideone.com/ckF0X5

I am using Python 3.6 and the result is tested on my computer.

from timeit import timeit
from collections import defaultdict, Counter
from random import shuffle
from itertools import product

def f():   # OP's method modified with Tom Karzes' answer above.
    d = {}
    for i, j in edges:
        if i not in d:
            d[i] = {j: 1}
        elif j in d[i]:
            d[i][j] += 1
        else:
            d[i][j] = 1

def count_first(): 
    d = dict()
    for (v, w), c in Counter(edges).items():
        if v not in d:
            d[v] = {w: c}
        else:
            d[v][w] = c
    # Alternatively, (Thanks to Jean-François Fabre to point it out.)
    # d = defaultdict(lambda : defaultdict(int)) 
    # for (v, w), c in Counter(edges).items(): 
    #     d[v][w] = c

edges = list(product(range(300), repeat=2)) * 10
shuffle(edges)

# %timeit f()
270 ms ± 23.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# %timeit count_first()
180 ms ± 15.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Declaimer: The results of count_first() that I get from the ideone.com is, however, slower than the OP's answer, f() here.

Stefan Pochmann did a benchmark experiment to compare different approaches in both Python 2 and 3. His result in Python 2 can be found here. For Python 3, check this. Credits to him and thanks his code review.

Tai
  • 7,684
  • 3
  • 29
  • 49
  • I get slightly more speed with `d = defaultdict(lambda : defaultdict(int)) for v, c in Counter(edges).items(): d[v[0]][v[1]] = c` – Jean-François Fabre Jan 01 '18 at 09:01
  • Interesting idea, this pre-counting. Why not `c = Counter(edges)`, though? And yes, for my benchmarks I used Python 2 because that's what the OP was using (Ideone supports Python 3 as well). I just ran 1-dimensional tests with more methods. In Python 2, `defaultdict` won, followed by `dict`, and `Counter` was a lot slower: https://ideone.com/PFK7yX. In Python 3, the picture is similar, except giving the data to the `Counter` constructor went from very slow to by far fastest: https://ideone.com/nCNaxp. I wonder why... – Stefan Pochmann Jan 01 '18 at 11:36
  • 1
    Ah, right. That part of `Counter` got faster because it got implemented in C. – Stefan Pochmann Jan 01 '18 at 11:51
  • @Jean-FrançoisFabre They seem to be roughly the same speed. (sharing the same mean and standard deviation. from my machine) But the code is cleaner! – Tai Jan 01 '18 at 17:38
  • @StefanPochmann Thanks for the heads-up and benchmark experiments. Updated the code. – Tai Jan 01 '18 at 17:59
  • Could also do `for (v, w), c in ...` – Stefan Pochmann Jan 01 '18 at 19:19