Why is data selection performance "much better" on lexicographically sorted dataframes?

Question

I'm working my way through Wes McKinney's new edition of Python for Data Analysis and on pg. 228 in Chapter 8 he notes that data selection performance in pandas is "much better" on hierarchically indexed objects (e.g., dataframes) if the index is lexicographically sorted starting with the outermost level.

In other words, data selection on this dataframe:

key1 key2 col1
1    a    11
     b    12
2    a    13
     b    14

...is "much better" than data selection on this dataframe:

key1 key2 col1
1    a    11
2    a    13
1    b    12
2    b    14

Wes doesn't provide an explanation for this statement.

Please, would anyone explain to me:

Why is data selection on the first dataframe "much better" than on the second dataframe? In other words, why is data selection on dataframes with a hierarchical index "much better" when the dataframe is lexicographically sorted starting on the outermost level?
What does "much better" mean in this context? Faster? More memory efficient? Something else?

Related: [What is the performance impact of non-unique indexes in pandas?](https://stackoverflow.com/questions/16626058/what-is-the-performance-impact-of-non-unique-indexes-in-pandas); [What is the point of indexing in pandas?](https://stackoverflow.com/questions/27238066/what-is-the-point-of-indexing-in-pandas). I think a good answer to this question will point to some of the pandas source code which (to me) is a minefield. — jpp, Feb 27 '18 at 16:28

Why is data selection performance "much better" on lexicographically sorted dataframes?

0 Answers0