1

In other languages, there are built-in data structures with binary search tree semantics, for example C++'s std::map.

Note that I am using C++ as an example, but not because I am attempting to draw a direct comparison between C++ itself and Python. I have received close votes on this question as "opinion based", but there is nothing "opinion based" about this question.

By binary search tree semantics I mean the following:

  • I can insert elements into the structure in sub-linear time, i.e. O(logn) for std::map
  • I can iterate over all of the elements in sorted order in linear time, i.e. O(n) for std::map

Is there any data structure in Python with binary search tree semantics?

To illustrate, can the following program be written efficiently with built-in Python data structures?

Imagine that I am a fund manager at a bank. Bank accounts are added to my fund very frequently. A bank account has an ID and a value. A smaller ID means that the bank account was created earlier than an account with a larger ID. Each time that an account is added to my fund, I would like to know the ID number of the 1000th oldest account in my fund, for my records.

Here are implementations in Python 3 and C++:

from random import randint

dist = lambda: randint(0, 2147483647)

def main():
    my_map = {}

    # fills a map with 1000000 random mappings of type (<int>: <int>)
    for i in range(0, 1000000):
        my_map[dist()] = dist()

    # prints out the 1000th smallest key and its value
    target_key = nth_smallest_key(1000, my_map)
    print("({}: {})".format(target_key, my_map[target_key]))

    # writes a new random mapping to the map
    # then prints out the 1000th smallest key and its value if that key
    # has changed
    # 100000 times
    for i in range(100000):
        my_map[dist()] = dist()

        test_key = nth_smallest_key(1000, my_map)
        if target_key != test_key:
            target_key = test_key
            print("({}: {})".format(target_key, my_map[target_key]))

        # print an indicator every 100 iterations for comparison
        if i % 100 == 0:
            print("iteration: {}".format(i))

def nth_smallest_key(n, m):
    return sorted(m.keys())[n]

if __name__ == "__main__":
    main()
#include <cstdio>
#include <map>
#include <random>
using namespace std;

int nth_smallest_key(int n, map<int, int>& m);

int main() {
    random_device rd;
    mt19937 psrng(rd());
    uniform_int_distribution<> dist(0, 2147483647);

    map<int, int> my_map;

    // fills a map with 1000000 random mappings of type (<int>: <int>)
    for (int i = 0; i < 1000000; i++) {
        my_map[dist(psrng)] = dist(psrng);
    }

    // prints out the 1000th smallest key and its value
    int target_key = nth_smallest_key(1000, my_map);
    printf("(%d: %d)\n", target_key, my_map[target_key]);

    // writes a new random mapping to the map
    // then prints out the 1000th smallest key and its value if that key
    // has changed
    // 100000 times
    for (int i = 0; i <= 100000; i++) {
        my_map[dist(psrng)] = dist(psrng);

        int test_key = nth_smallest_key(1000, my_map);
        if (target_key != test_key) {
            target_key = test_key;
            printf("(%d: %d)\n", target_key, my_map[target_key]);
        }
    }

    return 0;
}

int nth_smallest_key(int n, map<int, int>& m) {
    map<int, int>::iterator curr = m.begin();
    for (int i = 0; i < n; i++) {
        curr = next(curr);
    }
    return curr->first;
}

Makefile:

buildcpp:
        g++ -o main bst_semantics_cpp.cpp

runcpp: buildcpp
        ./main

runpython:
        python3 bst_semantics_python.py

In C++, this program runs on my machine in approximately 5 seconds

$ time ./main
(2211193: 2021141747)
(2208771: 1079444337)
(2208700: 1187119077)
(2208378: 1447743503)
...
(1996019: 1378401665)
(1989217: 1457497754)
(1989042: 1336229409)

real    0m4.915s
user    0m4.750s
sys     0m0.094s
$ 

However, in Python, the program has not reached 1% of completion after 120 seconds

$ time make runpython
python3 bst_semantics_python.py
(2158070: 1498305903)
iteration: 0
iteration: 100
iteration: 200
^CTraceback (most recent call last):
  File "bst_semantics_python.py", line 36, in <module>
    main()
  File "bst_semantics_python.py", line 23, in main
    test_key = nth_smallest_key(1000, my_map)
  File "bst_semantics_python.py", line 33, in nth_smallest_key
    return sorted(m.keys())[n]
KeyboardInterrupt
Makefile:8: recipe for target 'runpython' failed
make: *** [runpython] Error 1


real    2m2.040s
user    1m59.063s
sys     0m0.375s
$ 

To write this program efficiently, you need a data structure with binary search tree semantics.

Is there such a data structure in Python? Can you write this program efficiently with built-in Python data structures?

OregonTrail
  • 8,594
  • 7
  • 43
  • 58
  • My guess is that the C++ keys are sorted, so you always get the smallest one very quickly. Is that true? – Omer Tuchfeld Mar 14 '20 at 18:34
  • std::map is implemented as a binary search tree, which can be traversed in sorted order – OregonTrail Mar 14 '20 at 18:35
  • 1
    There are libraries that provide sorted dictionaries, e.g.: http://www.grantjenks.com/docs/sortedcontainers/ – UnholySheep Mar 14 '20 at 18:37
  • Part of the problem is the `smallest_key` is that you are sorting an entire array to then grab the smallest element, when you could just use `min(my_map)` – C.Nivs Mar 14 '20 at 18:37
  • The `sorted` call forces Python to go over the list of unsorted keys and sort them. A much faster implementation should be `min(my_map.keys())` – Omer Tuchfeld Mar 14 '20 at 18:37
  • `min` must iterate over all of the elements of `my_map` in Python, i.e. in time O(n). with std::map, `min` can be found in time O(logn). In other words, the optimization that you are suggestion is not an optimization. I will update the examples. – OregonTrail Mar 14 '20 at 18:39
  • I've updated the example with the proposed "improvement". Not having a built-in tree structure is a major downside of using Python. – OregonTrail Mar 14 '20 at 18:44
  • Note that calling `my_map.keys()` is also unnecessary, since dictionaries return keys when iterated over – C.Nivs Mar 14 '20 at 18:45
  • are you looking for a faster solution? – gold_cy Mar 14 '20 at 18:48
  • In Python, `min(my_map)` must iterate over all elements, in time O(n). In C++, `my_map.begin()` finds the smallest key in time O(logn). I've updated the question with all of the suggest "improvements", which are not improvements. – OregonTrail Mar 14 '20 at 18:48
  • @gold_cy I would be interested to know if there is a built-in data structure in Python with binary search tree semantics. I have also edited the question so that it is no longer "opinion based". – OregonTrail Mar 14 '20 at 18:49
  • have you considered using a heap queue to keep track of smallest item? not as fast as c++ but cuts python time to ~13 seconds. – gold_cy Mar 14 '20 at 18:55
  • Imagine that I would like to iterate over all elements in sorted order in each iteration of the loop. A heap is not an efficient data structure for this. You could also image that I want to find the value of 50000th smallest key. – OregonTrail Mar 14 '20 at 18:57
  • I suppose the real question is, why run `sorted` or `min` at all in your second loop? You are just checking if a new key is smaller than one you already track. Second, there's not really a place in this code where you iterate over your dictionary at all – C.Nivs Mar 14 '20 at 19:07
  • I have updated my question to more clearly illustrate the problem. – OregonTrail Mar 14 '20 at 19:36
  • Does this answer your question? [Why are there no sorted containers in Python's standard libraries?](https://stackoverflow.com/questions/5953205/why-are-there-no-sorted-containers-in-pythons-standard-libraries) – kaya3 Mar 14 '20 at 21:45
  • @kaya3 The answer to my question is implicit in that question. However, the justification for that answer is poor, OrderedDict is in the standard library, and is excessively confusing, because it is not a SortedDict. This question was closed as “opinion based”, however there is nothing “opinion based” about this question, unlike the answer to that question. – OregonTrail Mar 14 '20 at 22:18
  • @OregonTrail while I'm not sure opinion-based is the right flag here, "Seeking recommendations for libraries/resources/etc" could justifiably close this question – C.Nivs Mar 16 '20 at 13:42

2 Answers2

1

First, I wouldn't say this is an apples to apples comparison. I can't speak to whether or not your c++ is idiomatic, so I won't comment there. I can say that the python is not. The biggest issue I see here is that your code to re-define the smaller key can be done in a faster way:

# this is the second for loop, not the first
for i in range(100000):
    k, v = dist(), dist()
    my_map[k] = v

    # this avoids any new iteration through min or sorted
    min_key = min_key if min_key <= k else k

This avoids needless iteration over a data structure that you've already iterated over using min.

You also have an extra if check to print at the bottom of the python code that's not at the bottom of your c++ code.

I see this run in a non-scientific 4-ish seconds using python 3.6 with no other optimizations.

Why Avoid Sorted? Isn't it faster?

Yes, sorted is technically faster, by big-O notation, but that's not the whole story. With sorted you are also paying for creating a new data structure and filling it, which appears to cause any gains on paper to disappear in practice:

python -m timeit -s "from random import randint; x = [randint(0, 100000) for i in range(10000)]" "min(x)"
2000 loops, best of 5: 126 usec per loop

python -m timeit -s "from random import randint; x = [randint(0, 100000) for i in range(10000)]" "sorted(x)"
200 loops, best of 5: 953 usec per loop

You might point out that this test is somewhat random and repeated tests may yield different results, but best I can tell, min is consistently and significantly faster when finding the smallest element of a collection of data.

The actual question

Why doesn't python have a similar map to c++'s map? Probably because this is python and not c++. As snarky as that sounds, the people doing the good work on libraries and the like either have implemented it and I'm just not aware (plausible, so take this with a grain of salt), or it just hasn't been worth the time. If speed really matters that much, and you have a good handle of c++, then I'm not sure what moving to python would gain you.

Edit:

Reading through the comments, Unholy Sheep suggested the sortedcontainers library which appears to satisfy the requirements of a sorted dictionary (map)

C.Nivs
  • 12,353
  • 2
  • 19
  • 44
  • I have updated my question to more clearly illustrate the problem. The fact that Python does not have any built-in data structures with binary search tree semantics is not only a big drawback to using it in practice, but it means that "Python programmers" are not aware that they can perform these types of operations efficiently. My question is if there is anything like `sortedcontainers` in Python standard library. It seems like the answer is no. – OregonTrail Mar 14 '20 at 19:38
  • Python doesn't need to have *everything* in the standard library. It's a general purpose language, and so including everything would create significant bloat, since python is a bit bigger than c++. Besides, whatever isn't included, you can usually pip install, so I wouldn't call this a drawback. Are you sure python programmers aren't aware of how to write efficient BST code? Because I optimized your python *without* "binary search tree semantics" and it runs almost as quickly. Forgive my candor, but it seems like you've made up your mind regarding preferences of languages/libraries/etc – C.Nivs Mar 16 '20 at 13:34
  • It's like how even though python is quite popular in data science, it has no built in dataframe or multi dimensional array libraries. Why? There are great ones that already exist, are easy to pip install, and if they are not sufficient, the kind of programmer that would call `numpy` and `pandas` insufficient would be a good enough programmer that they could write their own. TL;DR: `pip install ` should not be viewed as a drawback, IMO – C.Nivs Mar 16 '20 at 13:50
  • I like Python as a language, which is why I'm disappointed that I can't use it in certain situations where installing libraries is not an option (I'm sure you can think of some situations where "using libraries" is not "allowed"). Also, your answer does not currently optimize the code in the question, which had been edited. – OregonTrail Mar 17 '20 at 20:50
  • Didn't see your edit, so point conceded there. There haven't been really any circumstances that *I've* encountered where a library wasn't allowed. However, my opinions are based on my experiences, and other folks have other experiences, therefore different opinions – C.Nivs Mar 17 '20 at 21:11
0

Python has no built-in data structures that allow this type of program to be expressed and implemented efficiently.

OregonTrail
  • 8,594
  • 7
  • 43
  • 58