19

Why is merging dataframes in Pandas on an index more efficient (faster) than on a column?

import pandas as pd

# Dataframes share the ID column
df = pd.DataFrame({'ID': [0, 1, 2, 3, 4],
                   'Job': ['teacher', 'scientist', 'manager', 'teacher', 'nurse']})

df2 = pd.DataFrame({'ID': [2, 3, 4, 5, 6, 7, 8],
                    'Level': [12, 15, 14, 20, 21, 11, 15], 
                    'Age': [33, 41, 42, 50, 45, 28, 32]})

enter image description here

df = df.set_index('ID')
df2 = df2.set_index('ID')

enter image description here

This represents about a 3.5 times speed up! (Using Pandas 0.23.0)

Reading through the Pandas internals page it says an Index "Populates a dict of label to location in Cython to do O(1) lookups." Does this mean that doing operations with an index is more efficient than with columns? Is it a best practice to always use the index for operations such as merges?

I read through the documentation for joining and merging and it doesn't explicitly mention any benefits to using the index.

willk
  • 3,727
  • 2
  • 27
  • 44
  • 1
    Related: [What is the performance impact of non-unique indexes in pandas?](https://stackoverflow.com/questions/16626058/what-is-the-performance-impact-of-non-unique-indexes-in-pandas) – jpp Jun 21 '18 at 14:22
  • 2
    @jpp is spot on to highlight the uniqueness as an issue. Recreate your example with non-unique indices and you will see that performance go away. Also, You are neglecting the time and effort to create the indices. And one last thing in `0.23` you can reference the names of the `index` levels in the `merge` so `df.merge(df2, on='ID')` works when `'ID'` is in the index or not. – piRSquared Jun 21 '18 at 14:28
  • @piRSquared The question linked to by jpp does not address the time difference between merging using columns and merging using indices. Specifically, why is there a significant time differences between the two calls? – willk Jun 21 '18 at 19:17
  • 2
    The dup target addresses what happens when doing look ups with an index that is unique and/or sorted. This is very much what happens when you place your column into the index. We can answer your question specific to your exact details. But is that adding any value above what the dup target already does? I decided that it didn't. If you are left still wondering what the answer is then maybe I was wrong and it isn't as obvious as I had thought. Give me a minute. – piRSquared Jun 21 '18 at 19:25
  • Ok, reopened. Someone can tie those elements together. I still think that most of the relevant information is embedded in that answer. – piRSquared Jun 21 '18 at 19:44
  • I appreciate the help. So is the answer that setting the index creates a sorted index allowing for more efficient merges? You are right that to compare the methods more fairly, I should take into account the time to create the index! – willk Jun 21 '18 at 21:13

1 Answers1

7

The reason for this is that the DataFrame's index is backed by a hash table.

To merge two sets, we need to find for each element of the first the corresponding in the second (if it exists) Searching is significantly faster if supported by a hash table because searching in an unsorted list is O(N), while in a list supported by a hash function ~O(1).

One strategy that could be faster to merge columns would be to first create a hash table for the smallest of the two. Still that means that the merge will be slower by the time it takes to create this dict.

ntg
  • 12,950
  • 7
  • 74
  • 95