0

I understand hash tables: In principle, you store a key's data in a fixed-size array, with the index/slot to use given by the key's hash. However, Python's dict class has the method dict.keys() which returns a list of the dict's keys. Where does this list come from? (Also, iterating over a dictionary implicitly iterates over its keys).


I tried to think about this myself and I identified the following requirements:

I thought maybe we could store the index of the next and previous non-empty slots for each slot, so we could jump to the next/prev element in O(1) and also clear a slot in O(1) (just update prev's next index and next's prev index). The problem is that insertion would be in O(log n) there because we would have to binary-search over the 'next' indexes.

Another explanation I considered is that maybe we just iterate over all the slots, and just ignore the empty slots and accept the runtime penalty from repeatedly checking empty slots every time we iterate over the keys. A disadvantage of this is that the hash table would need to be pretty full for this to be efficient, which would slow down insertions.


The related question "How are Python's Built In Dictionaries Implemented" never mentioned this aspect of dicts.

Edit: Source code for iteration seems to be here.

xjcl
  • 12,848
  • 6
  • 67
  • 89
  • Is iteration in O(n) a requirement of a dictionary? – Martin Beckett Jan 24 '19 at 16:53
  • It **is** discussed in the related question you linked. Ctrl+F for "probing". By the way all the things you listed as *O(1)* are really *O(1) average case*, and it's actually *O(n) Amortized Worst Case* (because of the probing). – wim Jan 24 '19 at 16:54
  • The things that go in the slots of a hashtable are *entries* that include both the key and its corresponding value. You HAVE to keep the key around, even if you have no need to support iteration of keys, so that you can recalculate the hashes when the table is resized. – jasonharper Jan 24 '19 at 16:55
  • @wim I thought "probing" is the strategy used to resolve hash collisions, and unrelated to my question? – xjcl Jan 24 '19 at 16:58
  • @MartinBeckett The site I linked said Python dicts have iteration in O(n) – xjcl Jan 24 '19 at 16:59
  • Re: probing. That's correct, but hash collisions are rare and there are many empty slots in the dictionary. So, usually you don't need to go probing, and then insertion/deleting can be O(1) - similar to indexing an array. – wim Jan 24 '19 at 18:31
  • @wim jasonharper Your answers do not correspond to what I wanted to know/ask, but I see why you got that impression from my question now. I think I communicated my question poorly. I edited it now so hopefully it's clearer. Basically I'm asking, **Where does dict.keys() source its list from?** – xjcl Jan 30 '19 at 17:32
  • `dict.keys()` does not return a list, it returns a view. And that view doesn't support *O(1)* indexing, as a proper list would. – wim Jan 30 '19 at 17:46
  • @wim Yes that's a good point! But once again not what I was asking. – xjcl Jan 30 '19 at 17:50
  • Possible dupes, or at least related questions: [Accessing dictionary items by position in Python 3.6+ efficiently](https://stackoverflow.com/q/52507860/674039) and [Why do dict_items objects not support indexing?](https://stackoverflow.com/q/52900328/674039) – wim Jan 30 '19 at 17:50
  • The question is really unclear, but it looks like you're just asking about how iteration of keys can be O(n). That's trivial! Even if there were 99 empty slots for every populated slot, it would still be O(n) just to iterate *all* slots. – wim Jan 30 '19 at 17:54

1 Answers1

0

I looked at the CPython source code here and while I'm still not 100% sure I think my second guess is correct: We just iterate over all the slots. We ignore empty slots and return all keys in non-empty slots.

Python dictionaries get resized once they are 2/3 full and (assuming no deletions) double in size after which they will be 2/6 = 1/3 full* which means we can get iteration in O(3n) (this is O(n) after dropping the constants).

*: The comment seems to be outdated, I think it would be even fuller with the current constants

xjcl
  • 12,848
  • 6
  • 67
  • 89