6

So i essentially want to implement the equivalent of R's match() function in Python, using Pandas dataframes - without using a for-loop.

In R match() returns a vector of the positions of (first) matches of its first argument in its second.

Let's say that I have two df A and B, of which both include the column C. Where

A$C = c('a','b')
B$C = c('c','c','b','b','c','b','a','a')

In R we would get

match(A$C,B$C) = c(7,3)

What is an equivalent method in Python for columns in pandas data frames, that doesn't require looping through the values.

M--
  • 25,431
  • 8
  • 61
  • 93
leiberl
  • 73
  • 1
  • 5

5 Answers5

6

Here is a one liner:

B.reset_index().groupby('C')['index'].first()[A.C].values

This solution returns the results in the same order as the input A, as match does in R.


Full example:

import pandas as pd

A = pd.DataFrame({'C':['a','b']})
B = pd.DataFrame({'C':['c','c','b','b','c','b','a','a']})

B.reset_index().groupby('C')['index'].first()[A.C].values
Output array([6, 2])

Edit (2023-04-12): In newer versions of pandas .loc matches all rows that match the condition. Thus, the previous solution (B.reset_index().set_index('c').loc[A.c, 'index'].values) would return all the matches instead of only the first ones.

toto_tico
  • 17,977
  • 9
  • 97
  • 116
  • Sadly this is criminally slower than R's match() for large vectors, e.g., 1e7, as of 2023. Python appears to have no truly equivalent implementation of R's match(), which has been balletically optimized in C.. – user2961927 Apr 11 '23 at 03:41
  • 1
    if you are interested in performance, there are quite a few options in Python: https://stackoverflow.com/questions/7632963/numpy-find-first-index-of-value-fast. I updated the answer, as pandas has evolved since I originally posted. – toto_tico Apr 12 '23 at 07:47
  • Thanks! Coded 4 approaches. The fastest still seems 1.5x slower than R's match()..: https://stackoverflow.com/a/76030311/2961927 – user2961927 Apr 16 '23 at 21:06
4

You can use first drop_duplicates and then boolean indexing with isin or merge.

Python counts from 0, so for same output add 1.

A = pd.DataFrame({'c':['a','b']})
B = pd.DataFrame({'c':['c','c','b','b','c','b','a','a']})


B = B.drop_duplicates('c')
print (B)
   c
0  c
2  b
6  a

print (B[B.c.isin(A.c)])
   c
2  b
6  a

print (B[B.c.isin(A.c)].index)
Int64Index([2, 6], dtype='int64')

print (pd.merge(B.reset_index(), A))
   index  c
0      2  b
1      6  a

print (pd.merge(B.reset_index(), A)['index'])
0    2
1    6
Name: index, dtype: int64
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 2
    Notice that the results here is return in a different order. In `R`, the order of the output of `match` corresponds to the input: `('a', 'b')` corresponds to `c(7,3)`, but the answer here returns the result for `b` first, and then `a`. See my answer for a one-liner option that respect the order. – toto_tico Apr 23 '20 at 13:26
2

This gives all the indices that are matched (with python's 0 based indexing):

import pandas as pd

df1 = pd.DataFrame({'C': ['a','b']})
print df1

   C
0  a
1  b

df2 = pd.DataFrame({'C': ['c','c','b','b','c','b','a','a']})
print df2   

   C
0  c
1  c
2  b
3  b
4  c
5  b
6  a
7  a

match = df2['C'].isin(df1['C'])
print [i for i in range(match.shape[0]) if match[i]]

#[2, 3, 5, 6, 7]
Sandipan Dey
  • 21,482
  • 2
  • 51
  • 63
1

Here is an implementation with various approaches. Without resorting to C or C++, the fastest method seems to be datatable:

def match(x, y, method = "dt"):
  '''
  x and y are two numpy 1d arrays containing only finite values.  
  
  method = 'dt': use datatable
  method = 'pandas': use pandas
  method = 'numpy': use numpy
  method = 'dict': use hashing.
  '''
  if method == 'dt': # Use datatable
    xdf = datatable.Frame({'val': x})
    ydf = datatable.Frame({'val': y, 'ind': np.arange(y.shape[0]) })[
      :, datatable.min(datatable.f.ind), datatable.by(datatable.f.val)]
    ydf.key = 'val'
    rst = xdf[:, :, datatable.join(ydf)]['ind'].to_numpy()
    return rst.filled(-1 - y.shape[0]).ravel()
  
  
  if method == 'pandas': # Use pandas dataframe.
    xdf = pd.DataFrame({'val': x})
    ydf = pd.DataFrame({'val': y, 'ind': np.arange(y.shape[0]) }).groupby(
      ['val']).min()
    joined = xdf.join(ydf, on = 'val', lsuffix = '_x', rsuffix = '_y')
    rst = joined['ind'].to_numpy()
    rst[np.isnan(rst)] = -1 - y.shape[0]
    return rst.astype(int)
  
  
  rst = np.zeros(x.shape[0], dtype = np.int32) - (y.shape[0] + 1)
  if method == 'numpy':
    yorder = y.argsort()
    ysorted = y[yorder]
    ind = np.searchsorted(ysorted, x)
    outofBound = ind >= y.shape[0]
    ind[outofBound] = 0
    eq = ysorted[ind] == x
    eq[outofBound] = False
    rst[eq] = yorder[ind][eq]

  else: # Hashing.

    D = dict(zip(y[::-1], np.arange(y.shape[0] - 1, -1, -1)))
    for i, u in enumerate(x):
      val = D.get(u)
      if val is not None: rst[i] = val
  return rst

Test code:

  import datatable
  import pandas
  import time
  import numpy as np
  
  
  N = int(1e9)
  k = int(1e7)
  x = np.random.choice(N, k)
  y = np.random.choice(N, k)
  timeCosts = {}
  
  
  st = time.time()
  ind = match(x, y, "dt")
  timeCosts['datatable'] = time.time() - st
  np.all(x[ind >= 0] == y[ind[ind >= 0]])
  
  
  st = time.time()
  ind = match(x, y, "pandas")
  timeCosts['pandas'] = time.time() - st
  np.all(x[ind >= 0] == y[ind[ind >= 0]])
  
  
  st = time.time()
  ind = match(x, y, "numpy")
  timeCosts['numpy'] = time.time() - st
  np.all(x[ind >= 0] == y[ind[ind >= 0]])
  
  
  st = time.time()
  ind = match(x, y, "hashing")
  timeCosts['hashing'] = time.time() - st
  np.all(x[ind >= 0] == y[ind[ind >= 0]])

The time costs in seconds: {'datatable': 1.55, 'pandas': 8.01, 'numpy': 14.91, 'hashing': 6.04}

But the fastest is still slower than R's match: 1.05s

R must have used some hashing technique similar to that in radix sort..

user2961927
  • 1,290
  • 1
  • 14
  • 22
0

I defined a python match function that takes lists or 1D np arrays as parameters, and then use it with columns from the pandas dataframes.

import numpy as np
import pandas as pd 

    
def match(needles,haystack): # parameters are python lists or 1 dimensional np arrays.
    # Copyleft 2023 David A. Kra  Creative Commons License: BY-SA Attribution-ShareAlike 
    # Derived from method3 in https://www.statology.org/numpy-find-index-of-value/ 
    # which is Copyright 2021 Zach Bobbitt# match(needles,haystack) returns indices to use
        #find index location of first occurrence of each value of interest
        # returns an np 1 dimensional array of indices. notfound values get None
        # tresult=np.array(needles.size*[2*1024*1024*1024], dtype='i4', copy=True,) # dtype 'intp' is for indexing, as an alternative to 'i4' or 'int32'
        sorter = np.argsort( nphaystack:=np.array(haystack) )
        tresult=sorter[np.searchsorted(nphaystack, (npneedles:=np.array(needles)), sorter=sorter)]
            # if a needle value is greater than any value in in the haystack, will throw an IndexError, such as 
            #   IndexError: index 6 is out of bounds for axis 0 with size 6
        # If a needle is missing from the haystack, the index returned is of the next higher needle that is in the haystack. 
        #    Instead, turn it into None.
        return [tresult[i]  if nphaystack[ tresult[i] ] == npneedles[i] else None for i in range(tresult.size) ]

# usage:

npHaystackID=pdHaystack['ID_BUC'].to_numpy()  
npNeedlesID=pdNeedles['ID_BUC'].to_numpy() 

rowsOfNeedlesInBUC=match(npNeedlesID,npHaystackID)
user15972
  • 124
  • 4