I was merging two data sets in pandas, and wanted to speed the process up, so I sorted both of them on the column that was being used for merge. (Previously those columns had been not at all ordered.) Sorting caused no discernible speed difference; both took about eight seconds.
If I was manually merging two stacks of paper based on, say, their page number, I would first sort each of them by page number. Otherwise I'd have to do a lot of flipping back and forth between the stacks.
I wrote a test to compare the two processes. It generates two frames with a million rows each, in random order. Then it generates two more that have been sorted on the first column. Then it merges the first two, and last, merges the second two.
The data generation process was so slow that I don't have time to try more rows -- but the merge nonetheless happens in zero perceptible time, even without sorting.
import pandas as pd
import numpy as np
def shuffle(df, n=1, axis=0):
""" https://stackoverflow.com/questions/15772009/shuffling-permutating-a-dataframe-in-pandas """
df = df.copy()
for _ in range(n):
df.apply(np.random.shuffle, axis=axis)
return df
# Create some shuffled data
df1 = pd.DataFrame({'A':range(1000000), 'B':range(1000000)})
df2 = pd.DataFrame({'A':range(1000000), 'B':range(1000000)})
df1 = shuffle(df1)
df2 = shuffle(df2)
# Sort that data on column A
df1s = df1.sort_values(by='A')
df2s = df2.sort_values(by='A')
m = df1. merge(df2, on='A') # Merge the unsorted data
ms = df1s.merge(df2s, on='A') # Merge the sorted data
EDIT: I wrote a test with data that's 50 times wider and 1/5 as tall, and now the sorting seems to help?
import pandas as pd
import numpy as np
def shuffle(df, n=1, axis=0):
""" https://stackoverflow.com/questions/15772009/shuffling-permutating-a-dataframe-in-pandas """
df = df.copy()
for _ in range(n):
df.apply(np.random.shuffle, axis=axis)
return df
# Create some shuffled data
nrows = 200000
reorderedIntegers = shuffle( pd.DataFrame({ 'A':range(nrows) }) )
reorderedIntegers2 = shuffle( pd.DataFrame({ 'A':range(nrows) }) )
# Widen it
extraColumns = pd.DataFrame( np.random.rand( nrows, 100 ) )
df = pd.concat( [reorderedIntegers, extraColumns], axis=1 )
df2 = pd.concat( [reorderedIntegers2, extraColumns], axis=1 )
# Create sorted copies
dfs = df .sort_values(by='A')
dfs2 = df2.sort_values(by='A')
# Compare merge speeds
m = df. merge(df2, on='A') # Merge the unsorted data
ms = dfs.merge(df2s, on='A') # Merge the sorted data -- substantially faster now
P.S. I wanted to use timeit.timeit() to measure the speed of the two routines, but I kept getting an error like the following:
>>> timeit.timeit( "ms = df1s.merge(df2s, on='A')" )
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/conda/lib/python3.6/timeit.py", line 233, in timeit
return Timer(stmt, setup, timer, globals).timeit(number)
File "/opt/conda/lib/python3.6/timeit.py", line 178, in timeit
timing = self.inner(it, self.timer)
File "<timeit-src>", line 6, in inner
NameError: name 'df1s' is not defined