31

I understand dictionaries are insertion ordered in Python 3.6+, as an implementation detail in 3.6 and official in 3.7+.

Given they are ordered, it seems strange that no methods exist to retrieve the ith item of a dictionary by insertion order. The only solutions available appear to have O(n) complexity, either:

  1. Convert to a list via an O(n) process and then use list.__getitem__.
  2. enumerate dictionary items in a loop and return the value when the desired index is reached. Again, with O(n) time complexity.

Since getting an item from a list has O(1) complexity, is there a way to achieve the same complexity with dictionaries? Either with regular dict or collections.OrderedDict would work.

If it's not possible, is there a structural reason preventing such a method, or is this just a feature which has not yet been considered / implemented?

jpp
  • 159,742
  • 34
  • 281
  • 339
  • It's implemented as a linked list. Otherwise, it would be impossible to delete elements. – o11c Sep 25 '18 at 23:25
  • I can think of an obscure reason. It makes JSON-lines more stable without an enclosing list and individual dictionaries. Beyond that very minor detail, I've not really understood the hype – roganjosh Sep 25 '18 at 23:26
  • @o11c, OK, so there's clearly a gap in my understanding. But I can see (maybe) what you mean, perhaps you need to have O(*n*) position access to keep O(1) deletion vs O(*n*) for lists. – jpp Sep 25 '18 at 23:32
  • 1
    @o11c According to https://stackoverflow.com/a/39980744/987358 there is just an array `dk_entries` of entries in order of insertion. No linked list. Deleted entries are replaced by a dummy and when adding new items, a resize of the array (with removing dummies) can happen. – Michael Butscher Sep 25 '18 at 23:32
  • 2
    @o11c it is *not* implemented as a linked-list – juanpa.arrivillaga Sep 25 '18 at 23:34
  • 7
    I think they just don't want to add sequence-like behavior to what is fundamentally a mapping. IOW: you shouldn't be using your dict like a list, but it *does* maintain order. – juanpa.arrivillaga Sep 25 '18 at 23:35

2 Answers2

39

For an OrderedDict it's inherently O(n) because the ordering is recorded in a linked list.

For the builtin dict, there's a vector (a contiguous array) rather than a linked list, but pretty much the same thing in the end: the vector contains a few kind of "dummies", special internal values that mean "no key has been stored here yet" or "a key used to be stored here but no longer". That makes, e.g., deleting a key extremely cheap (just overwrite the key with a dummy value).

But without adding auxiliary data structures on top of that, there's no way to skip over the dummies without marching over them one at a time. Because Python uses a form of open addressing for collision resolution, and keeps the load factor under 2/3, at least a third of the vector's entries are dummies. the_vector[i] can be accessed in O(1) time, but really has no predictable relation to the i'th non-dummy entry.

Tim Peters
  • 67,464
  • 13
  • 126
  • 132
  • From my understanding of the >3.6 implementation, there are *two* vectors, the sparse-index array is where the open-addressing occurs, but that the actual entries vector is simply that, an array of the entries in order, with no dummies, no? – juanpa.arrivillaga Sep 25 '18 at 23:39
  • 6
    @juanpa.arrivillaga, it's more complicated - what isn't? ;-) There are "split" and "non-split" dicts under the covers, etc. For a regular old dict ("non-split"), deleting a key _also_ sets the corresponding value slot to NULL, so same thing; you have to skip over the NULL values one at a time. See the loop in dictobject.c's `dictiter_iternextkey()`. Iterating over "the keys" _actually_ iterates over the values, which are in insertion order, but can contain NULLs in arbitrary places. Once a none-NULL value is found, it contains a pointer back to the key. – Tim Peters Sep 25 '18 at 23:51
  • Ah, I see. Just to make sure I am understanding you correctly, when you delete a key, it's actually set to `null` in the entries vector. This is unlike the POC implementation [here](http://code.activestate.com/recipes/578375/), where the value is simply popped from the vector (the list `self.entries` in `__delitem__`)? I assume the motivation is not to incur an O(N) penalty for deletions? – juanpa.arrivillaga Sep 26 '18 at 17:13
  • 2
    The POC you linked to was for something else entirely: a more space-efficient dict implementation. It doesn't preserve insertion order at all. Indeed, its "swap with the lastmost entry to avoid leaving a 'hole'" can move the last entry to any position whatsoever. The current implementation is both space-efficient and order-preserving, but leaving no holes on deletion while preserving order would require physically moving every entry following the one being deleted. Instead it just overwrites the deleted value with NULL (leaving "a hole"). – Tim Peters Sep 26 '18 at 18:23
3

As per @TimPeters' answer, there are structural reasons why you cannot access dictionary items by position in O(1) time.

It's worth considering the alternatives if you are looking for O(1) lookup by key or position. There are 3rd party libraries such as NumPy / Pandas which offer such functionality, efficient especially for numeric arrays where pointers are not required.

With Pandas, you can construct a "dictionary-like" series with unique labels offering O(1) lookup by "label" or position. What you sacrifice is performance when deleting a label, which incurs O(n) cost, much like list.

import pandas as pd

s = pd.Series(list(range(n)))

# O(n) item deletion
del s[i]
s.drop(i)
s.pop(i)

# O(1) lookup by label
s.loc[i]
s.at[i]
s.get(i)
s[i]

# O(1) lookup by position
s.iloc[i]
s.iat[i]

pd.Series is by no means a drop-in replacement for dict. For example, duplicate keys are not prevented and will cause issues if the series is used primarily as a mapping. However, where data is stored in a contiguous memory block, as in the example above, you may see significant performance improvements.

See also:

  1. What are the advantages of NumPy over regular Python lists?.
  2. What is the performance impact of non-unique indexes in pandas?
  3. Pandas DataFrame search is linear time or constant time?
jpp
  • 159,742
  • 34
  • 281
  • 339
  • 1
    Nice. I was thinking what would the easiest data structure be which fulfils OP's requirements. – Eric Duminil Sep 26 '18 at 11:25
  • 1
    @EricDuminil, Yes, indeed. One doesn't always think "Pandas series!" when considering alternatives to `dict`, but if certain criteria are met then it's certainly viable. The syntax is also often comparable, e.g. `s[i]`, `s.get(i)`, `del s[i]`, `s.keys()`, `s.items()`. – jpp Sep 26 '18 at 11:29