Difference between every row and column in two DataFrames (Python / Pandas)

Question

Is there a more efficient way to compare every column in every row in one DF to every column in every row of another DF? This feels sloppy to me, but my loop / apply attempts have been much slower.

df1 = pd.DataFrame({'a': np.random.randn(1000),
                   'b': [1, 2] * 500,
                   'c': np.random.randn(1000)},
                   index=pd.date_range('1/1/2000', periods=1000))
df2 = pd.DataFrame({'a': np.random.randn(100),
                'b': [2, 1] * 50,
                'c': np.random.randn(100)},
               index=pd.date_range('1/1/2000', periods=100))
df1 = df1.reset_index()
df1['embarrassingHackInd'] = 0
df1.set_index('embarrassingHackInd', inplace=True)
df1.rename(columns={'index':'origIndex'}, inplace=True)
df1['df1Date'] = df1.origIndex.astype(np.int64) // 10**9
df1['df2Date'] = 0
df2 = df2.reset_index()
df2['embarrassingHackInd'] = 0
df2.set_index('embarrassingHackInd', inplace=True)
df2.rename(columns={'index':'origIndex'}, inplace=True)
df2['df2Date'] = df2.origIndex.astype(np.int64) // 10**9
df2['df1Date'] = 0
timeit df3 = abs(df1-df2)

10 loops, best of 3: 60.6 ms per loop

I need to know which comparison was made, thus the ugly addition of each opposing index to the comparison DF so that it will end up in the final DF.

Thanks in advance for any assistance.

I forgot to mention, my actual DF's have millions of rows and dozens of columns to compare. With that size, the apply attempts were taking hours. — howMuchCheeseIsTooMuchCheese, Aug 31 '14 at 21:37
See related: http://stackoverflow.com/questions/17095101/outputting-difference-in-two-pandas-dataframes-side-by-side-highlighting-the-d — EdChum, Aug 31 '14 at 22:07
@EdChum yeah, I saw that one, that is determining what changed between two DF's, not the difference in the values. — howMuchCheeseIsTooMuchCheese, Sep 10 '14 at 16:24

unutbu · Accepted Answer · 2014-09-01T13:52:45.400

The code you posted shows a clever way to produce a subtraction table. However, it doesn't play to Pandas strengths. Pandas DataFrames store the underlying data in column-based blocks. So retrieval of the data is fastest when done by column, not by row. Since all the rows have the same index, the subtractions are performed by row (pairing each row with every other row), which means there is a lot of row-based data retrieval going on in df1-df2. That's not ideal for Pandas, particularly when not all the columns have the same dtype.

Subtraction tables are something NumPy is good at:

In [5]: x = np.arange(10)

In [6]: y = np.arange(5)

In [7]: x[:, np.newaxis] - y
Out[7]: 
array([[ 0, -1, -2, -3, -4],
       [ 1,  0, -1, -2, -3],
       [ 2,  1,  0, -1, -2],
       [ 3,  2,  1,  0, -1],
       [ 4,  3,  2,  1,  0],
       [ 5,  4,  3,  2,  1],
       [ 6,  5,  4,  3,  2],
       [ 7,  6,  5,  4,  3],
       [ 8,  7,  6,  5,  4],
       [ 9,  8,  7,  6,  5]])

You can think of x as one column of df1, and y as one column of df2. You'll see below that NumPy can handle all the columns of df1 and all the columns of df2 in basically the same way, using basically the same syntax.

The code below defines orig and using_numpy. orig is the code you posted, using_numpy is an alternative method which performs the subtraction using NumPy arrays:

In [2]: %timeit orig(df1.copy(), df2.copy())
10 loops, best of 3: 96.1 ms per loop

In [3]: %timeit using_numpy(df1.copy(), df2.copy())
10 loops, best of 3: 19.9 ms per loop

import numpy as np
import pandas as pd
N = 100
df1 = pd.DataFrame({'a': np.random.randn(10*N),
                   'b': [1, 2] * 5*N,
                   'c': np.random.randn(10*N)},
                   index=pd.date_range('1/1/2000', periods=10*N))
df2 = pd.DataFrame({'a': np.random.randn(N),
                'b': [2, 1] * (N//2),
                'c': np.random.randn(N)},
               index=pd.date_range('1/1/2000', periods=N))

def orig(df1, df2):
    df1 = df1.reset_index() # 312 µs per loop
    df1['embarrassingHackInd'] = 0 # 75.2 µs per loop
    df1.set_index('embarrassingHackInd', inplace=True) # 526 µs per loop
    df1.rename(columns={'index':'origIndex'}, inplace=True) # 209 µs per loop
    df1['df1Date'] = df1.origIndex.astype(np.int64) // 10**9 # 23.1 µs per loop
    df1['df2Date'] = 0

    df2 = df2.reset_index()
    df2['embarrassingHackInd'] = 0
    df2.set_index('embarrassingHackInd', inplace=True)
    df2.rename(columns={'index':'origIndex'}, inplace=True)
    df2['df2Date'] = df2.origIndex.astype(np.int64) // 10**9
    df2['df1Date'] = 0
    df3 = abs(df1-df2) # 88.7 ms per loop  <-- this is the bottleneck
    return df3

def using_numpy(df1, df2):
    df1.index.name = 'origIndex'
    df2.index.name = 'origIndex'
    df1.reset_index(inplace=True) 
    df2.reset_index(inplace=True) 
    df1_date = df1['origIndex']
    df2_date = df2['origIndex']
    df1['origIndex'] = df1_date.astype(np.int64) 
    df2['origIndex'] = df2_date.astype(np.int64) 

    arr1 = df1.values
    arr2 = df2.values
    arr3 = np.abs(arr1[:,np.newaxis,:]-arr2) # 3.32 ms per loop vs 88.7 ms 
    arr3 = arr3.reshape(-1, 4)
    index = pd.MultiIndex.from_product(
        [df1_date, df2_date], names=['df1Date', 'df2Date'])
    result = pd.DataFrame(arr3, index=index, columns=df1.columns)
    # You could stop here, but the rest makes the result more similar to orig
    result.reset_index(inplace=True, drop=False)
    result['df1Date'] = result['df1Date'].astype(np.int64) // 10**9
    result['df2Date'] = result['df2Date'].astype(np.int64) // 10**9
    return result

def is_equal(expected, result):
    expected.reset_index(inplace=True, drop=True)
    result.reset_index(inplace=True, drop=True)

    # expected has dtypes 'O', while result has some float and int dtypes. 
    # Make all the dtypes float for a quick and dirty comparison check
    expected = expected.astype('float')
    result = result.astype('float')
    columns = ['a','b','c','origIndex','df1Date','df2Date']
    return expected[columns].equals(result[columns])

expected = orig(df1.copy(), df2.copy())
result = using_numpy(df1.copy(), df2.copy())
assert is_equal(expected, result)

How x[:, np.newaxis] - y works:

This expression takes advantage of NumPy broadcasting. To understand broadcasting -- and in general with NumPy -- it pays to know the shape of the arrays:

In [6]: x.shape
Out[6]: (10,)

In [7]: x[:, np.newaxis].shape
Out[7]: (10, 1)

In [8]: y.shape
Out[8]: (5,)

The [:, np.newaxis] adds a new axis to x on the right, so the shape is (10, 1). So x[:, np.newaxis] - y is the subtraction of an array of shape (10, 1) with an array of shape (5,).

On the face of it, that doesn't make sense, but NumPy arrays broadcast their shape according to certain rules to try to make their shapes compatible.

The first rule is that new axes can be added on the left. So an array of shape (5,) can broadcast itself to shape (1, 5).

The next rule is that axes of length 1 can broadcast itself to arbitrary length. The values in the array are simply repeated as often as needed along the extra dimension(s).

So when arrays of shape (10, 1) and (1, 5) are put together in a NumPy arithmetic operation, they are both broadcasted up to arrays of shape (10, 5):

In [14]: broadcasted_x, broadcasted_y = np.broadcast_arrays(x[:, np.newaxis], y)

In [15]: broadcasted_x
Out[15]: 
array([[0, 0, 0, 0, 0],
       [1, 1, 1, 1, 1],
       [2, 2, 2, 2, 2],
       [3, 3, 3, 3, 3],
       [4, 4, 4, 4, 4],
       [5, 5, 5, 5, 5],
       [6, 6, 6, 6, 6],
       [7, 7, 7, 7, 7],
       [8, 8, 8, 8, 8],
       [9, 9, 9, 9, 9]])

In [16]: broadcasted_y
Out[16]: 
array([[0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4]])

So x[:, np.newaxis] - y is equivalent to broadcasted_x - broadcasted_y.

Now, with this simpler example under our belt, we can look at arr1[:,np.newaxis,:]-arr2.

arr1 has shape (1000, 4) and arr2 has shape (100, 4). We want to subtract the items in the axis of length 4, for each row along the 1000-length axis, and each row along the 100-length axis. In other words, we want the subtraction to form an array of shape (1000, 100, 4).

Importantly, we don't want the 1000-axis to interact with the 100-axis. We want them to be in separate axes.

So if we add an axis to arr1 like this: arr1[:,np.newaxis,:], then its shape becomes

In [22]: arr1[:, np.newaxis, :].shape
Out[22]: (1000, 1, 4)

And now, NumPy broadcasting pumps up both arrays to the common shape of (1000, 100, 4). Voila, a subtraction table.

To massage the values into a 2D DataFrame of shape (1000*100, 4), we can use reshape:

arr3 = arr3.reshape(-1, 4)

The -1 tells NumPy to replace -1 with whatever positive integer is needed for the reshape to make sense. Since arr has 1000*100*4 values, the -1 is replaced with 1000*100. Using -1 is nicer than writing 1000*100 however since it allows the code to work even if we change the number of rows in df1 and df2.

Can you explain how `x[:, np.newaxis]` works? I understand that x[:] just slices whole table, but I don't understand what (and how) happens with `np.newaxis`. And also what is that syntax? Is it numpy specific and can it be used in different ways? — Pawel Wisniewski, Sep 01 '14 at 11:33
I've added an explanation of how `x[:, np.newaxis] - y` works. — unutbu, Sep 01 '14 at 12:29

Difference between every row and column in two DataFrames (Python / Pandas)

1 Answers1

Linked