0

Beginning with two pandas dataframes of different shapes, what is the fastest way to select all rows in one dataframe that do not exist in the other (or drop all rows in one dataframe that already exist in the other)? And are the fastest methods different for string-valued columns vs. numeric columns? Operation should be roughly equivalent to the code below

import pandas as pd

string_df1 = pd.DataFrame({'latin':['a', 'b', 'c'],
                           'greek':['alpha', 'beta', 'gamma']})
string_df2 = pd.DataFrame({'latin':['z', 'c'],
                           'greek':['omega', 'gamma']})

numeric_df1 = pd.DataFrame({'A':[1, 2, 3],
                            'B':[1.01, 2.02, 3.03]})
numeric_df2 = pd.DataFrame({'A':[3, 9],
                            'B':[3.03, 9.09]})

def index_matching_rows(df1, df2, cols_to_match=None):
    '''    
    return index of subset of rows of df1 that are equal to at least one row in df2
    '''
    if cols_to_match is None:
        cols_to_match = df1.columns
    
    df1 = df1.reset_index()
    m = df1.merge(df2, on=cols_to_match[0], suffixes=('1','2'))
    
    query = '&'.join(['{0}1 == {0}2'.format(str(c)) for c in cols_to_match[1:]])

    m = m.query(query)
    
    return m['index']

print(string_df2.drop(index_matching_rows(string_df2, string_df1)))
print(numeric_df2.drop(index_matching_rows(numeric_df2, numeric_df1)))

output

  latin  greek
0     z  omega

   A     B
1  9  9.09

some naive performance testing

copies = 10
big_sdf1 = pd.concat([string_df1, string_df1]*copies)
big_sdf2 = pd.concat([string_df2, string_df2]*copies)
big_ndf1 = pd.concat([numeric_df1, numeric_df1]*copies)
big_ndf2 = pd.concat([numeric_df2, numeric_df2]*copies)
%%timeit
big_sdf2.drop(index_matching_rows(big_sdf2, big_sdf1))
# copies =  10:      2.61 ms   ±  27.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# copies =  20:      4.44 ms   ±  43.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# copies =  30:     18.4  ms   ± 132   µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# copies =  40:     74.6  ms   ± 453   µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# copies = 100: 19.2       s   ± 112   ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
big_ndf2.drop(index_matching_rows(big_ndf2, big_ndf1))
# copies = 10:  2.56 ms ±    29.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# copies = 20:  4.38 ms ±    75.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# copies = 30: 18.3  ms ±   194   µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# copies = 40: 76.5  ms ± 1.76    ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

data points with an exponential fit whose equation is y = 1.6exp(0.094x)

This code runs about as quickly for strings as for numeric data, and I think it's exponential in the length of the dataframe (the curve above is 1.6*exp(0.094x), fit to the string data). I'm working with dataframes that are on the order of 1e5 rows, so this is not a solution for me.


Here's the same performance check for Raymond Kwok's (accepted) answer below in case anyone can beat it later. It's O(n).

%%timeit
big_sdf1_tuples = big_sdf1.apply(tuple, axis=1)
big_sdf2_tuples = big_sdf2.apply(tuple, axis=1)
big_sdf2_tuples.isin(big_sdf1_tuples)
# copies =  100:     4.82  ms ±     22   µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# copies = 1000:    44.6   ms ±    386   µs per loop (mean ± std. dev. of 7 runs,  10 loops each)
# copies =  1e4:   450     ms ±  9.44    ms per loop (mean ± std. dev. of 7 runs,   1 loop each)
# copies =  1e5: 4.42       s ± 27.6     ms per loop (mean ± std. dev. of 7 runs,   1 loop each)
%%timeit
big_ndf1_tuples = big_ndf1.apply(tuple, axis=1)
big_ndf2_tuples = big_ndf2.apply(tuple, axis=1)
big_ndf2_tuples.isin(big_ndf1_tuples)
# copies =  100:     4.98 ms ±     28.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# copies = 1000:    47    ms ±    288   µs per loop (mean ± std. dev. of 7 runs,  10 loops each)
# copies =  1e4:   461    ms ±  4.41    ms per loop (mean ± std. dev. of 7 runs,   1 loop each)
# copies =  1e5: 4.58      s ± 30.1     ms per loop (mean ± std. dev. of 7 runs,   1 loop each)

Indexing into the longest dataframe with

big_sdf2_tuples.loc[~big_sdf2_tuples.isin(big_sdf1_tuples)]

to recover the equivalent of the output in my code above adds about 10 ms.

shortorian
  • 1,082
  • 1
  • 10
  • 19
  • 3
    Advisable to give input sample and expected output. Whereas we can read your code, we will have to generate sample df in most cases – wwnde Feb 26 '22 at 01:42
  • 2
    Based on your description, I guess `pandas.merge(df1, df2, on=cols_to_match)` is what you are looking for. However, your code doesn't even run for multiple reasons, so I can' t determine if it meets your requirements. Please provide [minimal-reproducible-example](https://stackoverflow.com/help/minimal-reproducible-example). – ken Feb 26 '22 at 06:07
  • thanks, sorry, was moving quickly at end of day on Friday. Question updated – shortorian Feb 28 '22 at 19:36

2 Answers2

2

Beginning with 2 dataframes:

df1 = {'Runner': ['A', 'A', 'A', 'A'],
 'Day': ['1', '3', '8', '9'],
 'Miles': ['3', '4', '4', '2']}

df2 = df.copy().drop([1,3])

where the 2nd has two rows less.

We can hash the rows:

df1_hashed = df1.apply(tuple, axis=1).apply(hash)
df2_hashed = df2.apply(tuple, axis=1).apply(hash)

and believe, like many people will, that 2 different rows are very very very unlikely to get the same hashed value,

and get rows from df1 that do not exist in df2:

df1[~df1_hashed.isin(df2_hashed)]

  Runner Day Miles
1      A   3     4
3      A   9     2

As for the speed difference between string/integers, I am sure you can test it with your real data.

Note 1: you may actually remove .apply(hash) from both lines.

Note 2: check the answer of this question out for more on isin and the use of hash.

Raymond Kwok
  • 2,461
  • 2
  • 9
  • 11
  • excellent, thanks. any idea why this is slightly slower for the numeric test frames I've added to the question compared to the string-valued frames? – shortorian Feb 28 '22 at 20:33
  • not sure about that. is it true that #1 `big_sdf1` equal to `big_ndf1.astype(str)` and #2 `big_sdf2` equal to `big_ndf2.astype(str)`? – Raymond Kwok Mar 01 '22 at 02:06
  • If you mean equal elementwise, then yes, the following expressions are both `True`: `big_sdf1.eq(big_sdf1.astype(str)).all(axis=None)` `big_sdf2.eq(big_sdf2.astype(str)).all(axis=None)` But note that pandas converts `str` to their `object` type under the hood, so these are `True` also: `big_sdf1.eq(big_sdf1.astype('object')).all(axis=None)` `big_sdf2.eq(big_sdf2.astype('object')).all(axis=None)` – shortorian Mar 08 '22 at 21:05
0

pandas has a built-in hashing utility that's more than an order of magnitude faster than series of tuples:

%%timeit
big_sdf1_hashed = pd.util.hash_pandas_object(big_sdf1)
big_sdf2_hashed = pd.util.hash_pandas_object(big_sdf2)
big_sdf1.loc[~big_sdf1_hashed.isin(big_sdf2_hashed)]
# copies =  100:      1.05 ms ±      9.54 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# copies = 1000:      1.99 ms ±     28.6  µs per loop (mean ± std. dev. of 7 runs,  100 loops each)
# copies =  1e4:     10.5  ms ±     47.5  µs per loop (mean ± std. dev. of 7 runs,  100 loops each)
# copies =  1e5:    126    ms ±    747    µs per loop (mean ± std. dev. of 7 runs,   10 loops each)
# copies =  1e7: 14.1       s ± 78.2      ms per loop (mean ± std. dev. of 7 runs,    1 loop each)
%%timeit
big_ndf1_hashed = pd.util.hash_pandas_object(big_ndf1)
big_ndf2_hashed = pd.util.hash_pandas_object(big_ndf2)
big_ndf1.loc[~big_ndf1_hashed.isin(big_ndf2_hashed)]
# copies =  100:    496 µs ±  12.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# copies = 1000:    772 µs ±  12.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# copies =  1e4:  3.88  ms ± 129   µs per loop (mean ± std. dev. of 7 runs,  100 loops each)
# copies =  1e5: 67.5   ms ± 775   µs per loop (mean ± std. dev. of 7 runs,   10 loops each)

And note that the difference in performance comes from creating the objects to be compared rather than searching series of different objects. For copies = int(1e5):

%%timeit
big_ndf1_hashed = pd.util.hash_pandas_object(big_ndf1)
# 25 ms ± 228 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
big_ndf1_tuples = big_ndf1.apply(tuple, axis=1)
# 2.53 s ± 16.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

the hashed series is also three times smaller on disk than the tuples ( 9 mb vs. 33)

shortorian
  • 1,082
  • 1
  • 10
  • 19