6

I was merging two data sets in pandas, and wanted to speed the process up, so I sorted both of them on the column that was being used for merge. (Previously those columns had been not at all ordered.) Sorting caused no discernible speed difference; both took about eight seconds.

If I was manually merging two stacks of paper based on, say, their page number, I would first sort each of them by page number. Otherwise I'd have to do a lot of flipping back and forth between the stacks.

I wrote a test to compare the two processes. It generates two frames with a million rows each, in random order. Then it generates two more that have been sorted on the first column. Then it merges the first two, and last, merges the second two.

The data generation process was so slow that I don't have time to try more rows -- but the merge nonetheless happens in zero perceptible time, even without sorting.

import pandas as pd
import numpy as np

def shuffle(df, n=1, axis=0):
  """ https://stackoverflow.com/questions/15772009/shuffling-permutating-a-dataframe-in-pandas """
  df = df.copy()
  for _ in range(n):
    df.apply(np.random.shuffle, axis=axis)
  return df

# Create some shuffled data
df1 = pd.DataFrame({'A':range(1000000), 'B':range(1000000)})
df2 = pd.DataFrame({'A':range(1000000), 'B':range(1000000)})
df1 = shuffle(df1)
df2 = shuffle(df2)

# Sort that data on column A
df1s = df1.sort_values(by='A')
df2s = df2.sort_values(by='A')

m  = df1. merge(df2,  on='A') # Merge the unsorted data
ms = df1s.merge(df2s, on='A') # Merge the sorted data

EDIT: I wrote a test with data that's 50 times wider and 1/5 as tall, and now the sorting seems to help?

import pandas as pd
import numpy as np

def shuffle(df, n=1, axis=0):
  """ https://stackoverflow.com/questions/15772009/shuffling-permutating-a-dataframe-in-pandas """
  df = df.copy()
  for _ in range(n):
    df.apply(np.random.shuffle, axis=axis)
  return df

# Create some shuffled data
nrows = 200000
reorderedIntegers  = shuffle( pd.DataFrame({ 'A':range(nrows) }) )
reorderedIntegers2 = shuffle( pd.DataFrame({ 'A':range(nrows) }) )

# Widen it
extraColumns = pd.DataFrame( np.random.rand( nrows, 100 ) )
df  = pd.concat( [reorderedIntegers,  extraColumns], axis=1 )
df2 = pd.concat( [reorderedIntegers2, extraColumns], axis=1 )

# Create sorted copies
dfs  = df .sort_values(by='A')
dfs2 = df2.sort_values(by='A')

# Compare merge speeds
m  = df. merge(df2,  on='A') # Merge the unsorted data
ms = dfs.merge(df2s, on='A') # Merge the sorted data -- substantially faster now

P.S. I wanted to use timeit.timeit() to measure the speed of the two routines, but I kept getting an error like the following:

>>> timeit.timeit( "ms = df1s.merge(df2s, on='A')" )
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/lib/python3.6/timeit.py", line 233, in timeit
    return Timer(stmt, setup, timer, globals).timeit(number)
  File "/opt/conda/lib/python3.6/timeit.py", line 178, in timeit
    timing = self.inner(it, self.timer)
  File "<timeit-src>", line 6, in inner
NameError: name 'df1s' is not defined
Jeffrey Benjamin Brown
  • 3,427
  • 2
  • 28
  • 40
  • 1
    as for your side note, i use jupyter notebook for timing. simply use the magic `%%timeit` – MattR Mar 20 '18 at 20:44
  • Part of it is probably that pandas is mostly implemented in C++ under the hood, so it's generally pretty darn fast. – seaotternerd Mar 20 '18 at 22:49
  • Where exactly are all those .cpp [files](https://github.com/pandas-dev/pandas/tree/master/pandas/_libs)? ;) @seaotternerd – Brad Solomon Mar 21 '18 at 01:04
  • Here: https://github.com/pandas-dev/pandas/tree/master/pandas/_libs/src. Also a bunch of it's mixed through the Python code with Cython. And some comes from the fact that Pandas makes heavy use of numpy, which also makes heavy use of cython. – seaotternerd Mar 21 '18 at 01:11
  • I'm aware @seaotternerd. My point is that it's utilizing C (.c, .h) and Cython (.pxd, .pyx), not C++. Although, there has been discussion about writing the low-level framework of a revamped Pandas in C++11. – Brad Solomon Mar 22 '18 at 17:01

1 Answers1

5

For starters, a pandas DataFrame is not implemented as a simple multidimensional array. In the code it describes the object as a:

Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects.

Which is pretty complex and I don't expect anyone to wrap their head around that right away.

What this mentions is that it 'can be though of' as a dictionary like object. This mean that it may use some sort of hash map, implying a constant lookup time.

Merging a hash maps is not comparable to merging arrays (which is what merging 2 stacks of papers is doing) as the backend structure is completely different. Therefore sorting would not make a difference.

Unfortunately the connection between a DataFrame and a hash map is not perfect. Hash maps typically are unsorted and cannot have duplicate entries, neither of which match the DataFrame object implementation.

The other possibility, which seems more likely from looking into the code, is that since the merge opperation does not assume that the column is sorted, it goes ahead and sorts the columns itself before applying a more array-like merge. This means that pre-sorting will not make a difference as merging will re-sort the column anyways.

The pandas DataFrame object code can be found here and the merge opperation for the DataFrame's merge opperation can be found here.