0

I have a large df (df1) with binary outputs in each column like so:

df1:

  a b c d
1 1 0 1 0
2 0 0 0 0
3 0 1 0 1
4 1 1 0 0
5 1 0 0 0
6 1 0 1 1
...

I also have another smaller df (df2) with some "template" rows and I want to check if df1s rows contain. Templates looks like this:

df2:

  a b c d
1 1 0 1 0
2 1 1 1 1
3 0 0 0 1
4 1 1 0 0

What I'm trying to do is to search the large df efficiently for these small number of templates, so in this example, rows 1, 3, 4, 6 would match, but 2 and 5 would not match. I want any row in the large df which has any extra 1s to pass the test (i.e. a template row is there but it also has some extra 1s in that row).

I know that I could just have a nested loop and iterate over all the rows of the large and small dfs and match rows as np.arrays, but this seems like an extremely inefficient way to do this. I'm wondering if there are any non-iterative pd-based solutions to this problem?

Thank you so much!

Minor functionality edit: Along with searching and matching, I'm also trying to retain a list of which template row from df2 each row in df1 matched with so I can do statistics on how many templates show up in the large df and which ones they are. This is one of the reasons why this answer(Compare Python Pandas DataFrames for matching rows) doesn't work.

Arkansin
  • 47
  • 6
  • Does this answer your question? [Compare Python Pandas DataFrames for matching rows](https://stackoverflow.com/questions/29464234/compare-python-pandas-dataframes-for-matching-rows) – Chris Mar 11 '20 at 21:52
  • 1
    @Chris, not exactly since I also want to know which template is showing up and how often. Also I want to retain rows which _contain_ the template but also have extra 1s... – Arkansin Mar 11 '20 at 21:54
  • Can you share sample data frames along with minimal code that you have tried so far? – Shadab Hussain Mar 11 '20 at 22:01

1 Answers1

0

This logic will tell you where there are matches based on your requirements. Here I just created a new column in the original DF, but you'd probably want to create a third DF and keep adding columns to it for each test. Then you can just sum up the columns/rows to get your totals:

df1

   a  b  c  d
1  1  0  1  0
2  0  0  0  0
3  0  1  0  1
4  1  1  0  0

df2

   a  b  c  d
1  1  0  1  0
2  1  1  1  1
3  0  0  0  1
4  1  1  0  0

Logic

t_match =[]
for index, row in df2.iterrows():
    t_match.append(((df1-row) >= 0).all(axis=1).sum())

Output

t_match
[1, 0, 1, 1]
Chris
  • 15,819
  • 3
  • 24
  • 37
  • Thanks so much for this answer @Chris, but this is assuming both dfs are the same size. My df1 is much larger (~50k rows) than my df2 which has the template rows (~10 rows). I want to search for these 10 templates in the 50k rows of df1. – Arkansin Mar 11 '20 at 22:39
  • Updated to iterate over the template and look for any matches across the original df per template row, keeping a sum of the matches in a list – Chris Mar 11 '20 at 23:20
  • Thanks so much again @Chris. I think understand how this works except for what .all(axis=1) and sum() are doing. Would you mind explaining? – Arkansin Mar 11 '20 at 23:28
  • It's taking row wise sum of the true/false returned from the >=0, where True is 1 and False is 0 – Chris Mar 11 '20 at 23:29
  • 1
    Alright, I tried it and tested it with various values. 100% works and is a very elegant solution. – Arkansin Mar 11 '20 at 23:49