7

I'm looking for the time complexity of these methods as a function of the number of rows in a dataframe, n.

Another way of asking this question is: Are indexes for dataframes in pandas btrees (with log(n) time look ups) or hash tables (with constant time lookups)?

Asking this question because I'd like a way to do constant time look ups for rows in a dataframe based on a custom index.

Peter Berg
  • 6,006
  • 8
  • 37
  • 51

2 Answers2

4

Alright so it would appear that:

1) You can build your own index on a dataframe with .set_index in O(n) time where n is the number of rows in the dataframe

2) The index is lazily initialized and built (in O(n) time) the first time you try to access a row using that index. So accessing a row for the first time using that index takes O(n) time

3) All subsequent row access takes constant time.

So it looks like the indexes are hash tables and not btrees.

Peter Berg
  • 6,006
  • 8
  • 37
  • 51
  • 1
    "The index is lazily initialized and built (in O(n) time) the first time you try to access a row using that index." Is it? Do you have a link to docs saying this? – juanpa.arrivillaga Nov 15 '19 at 18:35
  • @juanpa.arrivillaga Wes [here](https://www.slideshare.net/wesm/a-look-at-pandas-design-and-development) makes a few references to khash from klib which I believe they use for the indices. – ayhan Nov 15 '19 at 18:51
  • @juanpa.arrivillaga No. Hence the preface "It would appear that". If I were able to find documentation on this I wouldn't have asked the question :) – Peter Berg Nov 15 '19 at 20:55
  • I would also be curious about the specific behavior you are seeing that makes it appear that way. There isn't a lot of laziness in `pandas` so I'm just curious. And granted, you are asking a question but you are also posting an answer :) – juanpa.arrivillaga Nov 15 '19 at 20:58
  • 2
    Accessing the first row in a dataframe with ~10M rows using the custom index I built took ~3 seconds. Everything after that took milliseconds. Witnessed this behavior a few separate times. ¯\\_(ツ)_/¯ – Peter Berg Nov 15 '19 at 21:01
  • 1
    [Here's](https://stackoverflow.com/a/39055999/2285236) another related post. – ayhan Nov 15 '19 at 21:02
  • Interesting, any details you can add to your answer I think are valuable – juanpa.arrivillaga Nov 15 '19 at 21:10
  • @juanpa.arrivillaga, see my answer https://stackoverflow.com/a/72245331/11262633 to this question. – mherzog May 15 '22 at 03:28
1

From the Pandas Internals documentation, the default DataFrame index

Populates a dict of label to location in Cython to do O(1) lookups.

dict uses hash tables, supporting Peter Berg's answer to this question.

mherzog
  • 1,085
  • 1
  • 12
  • 24