0

I am looking to optimize a piece of code that takes each row in one dataframe, and searches if those values exist in any row of a second dataframe. I then append the second dataframe rows to a list, followed by appending the matched rows of the first dataframe. The rows that are not found in the second dataframe should not be appended in the list, demonstrated by the code below.

I have written code to do this, but the problem is that it doesn't run fast enough with larger dataframes with millions of rows each. Is there any way to speed this up? I have tried incorporating itertuples and other methods to speed up the search but it still runs slow.

This is what the first dataframe looks like:

enter image description here

And this is what the second dataframe looks like:

enter image description here

The final dataframe that would be produced by the code below following the logic described would look like this. As you can see, the first dataframe at (column name) 'index' 1 shares no values with the second dataframe, and has no rows from the second dataframe in the index position underneath it, whereas (column name) 'index' 2 found a match with the last row in the second dataframe:

enter image description here

import pandas as pd 
import numpy as np

df1 = pd.DataFrame({"A": ["a", "b", "c", "d"], "B": ["e", "f", "g", "h"], "C": ["z", "l", "r", "s"], "index": [1, 2, 3, 4]})
df2 = pd.DataFrame({"A": ["q", "r", "c", "b"], "B": ["a", "b", "c", "d"], "C": ["g", "g", "g", "g"]})

g = []
for row in df1.itertuples():
    g.append(df1.loc[[row.Index]])
    g.append(df2.loc[
        
        (df2.A.values == row.A) |
        (df2.B.values == row.B) |
        (df2.C.values == row.C)
        
    ])
pd.concat(g)
Fust
  • 19
  • 7
  • 1
    Was it your intention for ```df1``` to have an actual ```index``` of 1, 2, 3, 4? As written, ```"index"``` is the 4th column of the DF, (whose actual index is 0..3). If you want the _actual index_ to be 1..4, try ```df1 = pd.DataFrame({"A" : [...], "B" : [...], "C" : [...]}, index = [1,2,3,4]) – NateB Aug 07 '20 at 22:59
  • the index column in df1 is just a column to keep track of the records in df1, so that when they are appended to the final dataframe, they are identifiable. – Fust Aug 07 '20 at 23:04

1 Answers1

0

The following shows promise:

import pandas as pd 
import numpy as np

df1 = pd.DataFrame({"A": ["a", "b", "c", "d"], "B": ["e", "f", "g", "h"], "C": ["z", "l", "r", "s"], "index": [1, 2, 3, 4]})
df2 = pd.DataFrame({"A": ["q", "r", "c", "b"], "B": ["a", "b", "c", "d"], "C": ["g", "g", "g", "g"]})

combined_df = df1.append(df2[df2.A.isin(df1.A) | df2.B.isin(df1.B) | df2.C.isin(df1.C)])

   A  B  C  index
0  a  e  z    1.0
1  b  f  l    2.0
2  c  g  r    3.0
3  d  h  s    4.0
2  c  c  g    NaN
3  b  d  g    NaN

I have not put this to the test on a large DF, and your test data only contains matching chars in the "A" column for now, so you'll want to put it through its paces, but it might be helpful...

NateB
  • 509
  • 1
  • 3
  • 9
  • This doesn't accomplish the task as the "matched" rows from the second df do not show up immediately after the row with which it matched in df1 – Fust Aug 08 '20 at 01:15
  • See [sort on values](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html) , [sort by index](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_index.html), for tools to process the combined df, this includes means to sort by the index, and allows you to specify if NANs come first, or after, etc. – NateB Aug 08 '20 at 01:55