The code you posted shows a clever way to produce a subtraction table. However, it doesn't play to Pandas strengths. Pandas DataFrames store the underlying data in column-based blocks. So retrieval of the data is fastest when done by column, not by row. Since all the rows have the same index, the subtractions are performed by row (pairing each row with every other row), which means there is a lot of row-based data retrieval going on in df1-df2
. That's not ideal for Pandas, particularly when not all the columns have the same dtype.
Subtraction tables are something NumPy is good at:
In [5]: x = np.arange(10)
In [6]: y = np.arange(5)
In [7]: x[:, np.newaxis] - y
Out[7]:
array([[ 0, -1, -2, -3, -4],
[ 1, 0, -1, -2, -3],
[ 2, 1, 0, -1, -2],
[ 3, 2, 1, 0, -1],
[ 4, 3, 2, 1, 0],
[ 5, 4, 3, 2, 1],
[ 6, 5, 4, 3, 2],
[ 7, 6, 5, 4, 3],
[ 8, 7, 6, 5, 4],
[ 9, 8, 7, 6, 5]])
You can think of x
as one column of df1
, and y
as one column of df2
. You'll see below that NumPy can handle all the columns of df1
and all the columns of df2
in basically the same way, using basically the same syntax.
The code below defines orig
and using_numpy
. orig
is the code you posted, using_numpy
is an alternative method which performs the subtraction using NumPy arrays:
In [2]: %timeit orig(df1.copy(), df2.copy())
10 loops, best of 3: 96.1 ms per loop
In [3]: %timeit using_numpy(df1.copy(), df2.copy())
10 loops, best of 3: 19.9 ms per loop
import numpy as np
import pandas as pd
N = 100
df1 = pd.DataFrame({'a': np.random.randn(10*N),
'b': [1, 2] * 5*N,
'c': np.random.randn(10*N)},
index=pd.date_range('1/1/2000', periods=10*N))
df2 = pd.DataFrame({'a': np.random.randn(N),
'b': [2, 1] * (N//2),
'c': np.random.randn(N)},
index=pd.date_range('1/1/2000', periods=N))
def orig(df1, df2):
df1 = df1.reset_index() # 312 µs per loop
df1['embarrassingHackInd'] = 0 # 75.2 µs per loop
df1.set_index('embarrassingHackInd', inplace=True) # 526 µs per loop
df1.rename(columns={'index':'origIndex'}, inplace=True) # 209 µs per loop
df1['df1Date'] = df1.origIndex.astype(np.int64) // 10**9 # 23.1 µs per loop
df1['df2Date'] = 0
df2 = df2.reset_index()
df2['embarrassingHackInd'] = 0
df2.set_index('embarrassingHackInd', inplace=True)
df2.rename(columns={'index':'origIndex'}, inplace=True)
df2['df2Date'] = df2.origIndex.astype(np.int64) // 10**9
df2['df1Date'] = 0
df3 = abs(df1-df2) # 88.7 ms per loop <-- this is the bottleneck
return df3
def using_numpy(df1, df2):
df1.index.name = 'origIndex'
df2.index.name = 'origIndex'
df1.reset_index(inplace=True)
df2.reset_index(inplace=True)
df1_date = df1['origIndex']
df2_date = df2['origIndex']
df1['origIndex'] = df1_date.astype(np.int64)
df2['origIndex'] = df2_date.astype(np.int64)
arr1 = df1.values
arr2 = df2.values
arr3 = np.abs(arr1[:,np.newaxis,:]-arr2) # 3.32 ms per loop vs 88.7 ms
arr3 = arr3.reshape(-1, 4)
index = pd.MultiIndex.from_product(
[df1_date, df2_date], names=['df1Date', 'df2Date'])
result = pd.DataFrame(arr3, index=index, columns=df1.columns)
# You could stop here, but the rest makes the result more similar to orig
result.reset_index(inplace=True, drop=False)
result['df1Date'] = result['df1Date'].astype(np.int64) // 10**9
result['df2Date'] = result['df2Date'].astype(np.int64) // 10**9
return result
def is_equal(expected, result):
expected.reset_index(inplace=True, drop=True)
result.reset_index(inplace=True, drop=True)
# expected has dtypes 'O', while result has some float and int dtypes.
# Make all the dtypes float for a quick and dirty comparison check
expected = expected.astype('float')
result = result.astype('float')
columns = ['a','b','c','origIndex','df1Date','df2Date']
return expected[columns].equals(result[columns])
expected = orig(df1.copy(), df2.copy())
result = using_numpy(df1.copy(), df2.copy())
assert is_equal(expected, result)
How x[:, np.newaxis] - y
works:
This expression takes advantage of NumPy broadcasting.
To understand broadcasting -- and in general with NumPy -- it pays to know the shape of the arrays:
In [6]: x.shape
Out[6]: (10,)
In [7]: x[:, np.newaxis].shape
Out[7]: (10, 1)
In [8]: y.shape
Out[8]: (5,)
The [:, np.newaxis]
adds a new axis to x
on the right, so the shape is (10, 1)
. So x[:, np.newaxis] - y
is the subtraction of an array of shape (10, 1)
with an array of shape (5,)
.
On the face of it, that doesn't make sense, but NumPy arrays broadcast their shape according to certain rules to try to make their shapes compatible.
The first rule is that new axes can be added on the left. So an array of shape (5,)
can broadcast itself to shape (1, 5)
.
The next rule is that axes of length 1 can broadcast itself to arbitrary length. The values in the array are simply repeated as often as needed along the extra dimension(s).
So when arrays of shape (10, 1)
and (1, 5)
are put together in a NumPy arithmetic operation, they are both broadcasted up to arrays of shape (10, 5)
:
In [14]: broadcasted_x, broadcasted_y = np.broadcast_arrays(x[:, np.newaxis], y)
In [15]: broadcasted_x
Out[15]:
array([[0, 0, 0, 0, 0],
[1, 1, 1, 1, 1],
[2, 2, 2, 2, 2],
[3, 3, 3, 3, 3],
[4, 4, 4, 4, 4],
[5, 5, 5, 5, 5],
[6, 6, 6, 6, 6],
[7, 7, 7, 7, 7],
[8, 8, 8, 8, 8],
[9, 9, 9, 9, 9]])
In [16]: broadcasted_y
Out[16]:
array([[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4]])
So x[:, np.newaxis] - y
is equivalent to broadcasted_x - broadcasted_y
.
Now, with this simpler example under our belt, we can look at
arr1[:,np.newaxis,:]-arr2
.
arr1
has shape (1000, 4)
and arr2
has shape (100, 4)
. We want to subtract the items in the axis of length 4, for each row along the 1000-length axis, and each row along the 100-length axis. In other words, we want the subtraction to form an array of shape (1000, 100, 4)
.
Importantly, we don't want the 1000-axis
to interact with the 100-axis
. We want them to be in separate axes.
So if we add an axis to arr1
like this: arr1[:,np.newaxis,:]
, then its shape becomes
In [22]: arr1[:, np.newaxis, :].shape
Out[22]: (1000, 1, 4)
And now, NumPy broadcasting pumps up both arrays to the common shape of (1000, 100, 4)
. Voila, a subtraction table.
To massage the values into a 2D DataFrame of shape (1000*100, 4)
, we can use reshape
:
arr3 = arr3.reshape(-1, 4)
The -1
tells NumPy to replace -1
with whatever positive integer is needed for the reshape to make sense. Since arr
has 1000*100*4 values, the -1
is replaced with 1000*100
. Using -1
is nicer than writing 1000*100
however since it allows the code to work even if we change the number of rows in df1
and df2
.