0

So this question concerns how to select a subset of rows in a data frame based on values in an array (or a single column). It is not enough for me to solve my problem.

I have many different tables in multiple directories. I have a dictionary with relations between tables (e.g keys for join). For each table T1, I lookup other tables (T2, T3...) that share same column names (keys) and I want to filter those tables (T2, T3...) to include rows that have matching key values in a set of columns with T1. Key set may vary! T1 may connect to T2 on one column (key) while T1 may connect with T2 on 5 keys! I do not know this beforehand.

So for example I have t1, t2, t3 and pks=["id"] (t1-->t2), fks=["id", "index", "zip"] (t1-->t3)

t1
id|index|zip|v
10|10000|200|20

t2
id|v
10|30
20|50
30|70

t3
id|index|zip|v
00|10000|200|10
10|10000|200|20
10|10000|300|30
10|10000|200|10

the output of t2 and t3 would be

t2
id|v
10|30

and t3

id|index|zip|v
10|10000|200|20
10|10000|200|10

Looking at the previous answer I would probably need to do smth like

filtered_t2 = t2.loc[t2[pks].isin(t1[fks])]

But i get the following error

ValueError: Cannot index with multidimensional key

Probably in this way I cannot handle compound key, but it also fails if I just provide one key -- 'id'! So maybe it cannot accept an array as values ...

How do I handle it when pks and fks are arrays of variable sizes?

Would this be a correct approach:

    filter = None
    for p, f in zip(pks, fks):
        if filter is None:
            filter = t2[p].isin(t1[f])
        else:
            filter &= t2[p].isin(t1[f])

    filtered_ft = t2.loc[filter]

Thanks!

YohanRoth
  • 3,153
  • 4
  • 31
  • 58

1 Answers1

2

Let us try merge here

t2.merge(t1,how='inner',on=['id'])

t3.merge(t1,how='inner',on=['id','index','zip'])

Do another way

t2[t2[pks].apply(tuple,1).isin(t1[pks].apply(tuple,1))]
BENY
  • 317,841
  • 20
  • 164
  • 234
  • but merge brings two tables together.. while I just want to filter out second table. I probably can drop cols of t1, but then renaming might be a pain... Can you check my suggested approach in the edited version? Would smth like this be correct? – YohanRoth Aug 09 '19 at 00:45
  • @it's a bit hard to understand the logic flow. can you briefly describe what it is doing – YohanRoth Aug 09 '19 at 00:48
  • @YohanRoth convert the column you need to tuple (each row), which can allow us using isin – BENY Aug 09 '19 at 00:49
  • @YohanRoth first select all column you need to check , then we zip each row value to tuple, after that we can using isin to check whether it exit in another data frame or not – BENY Aug 09 '19 at 01:05
  • I guess I am not super clear why we need to convert it to a tuple... do you think my solution in the post is also fine (except it's longer and uglier)? what do you think would be faster – YohanRoth Aug 09 '19 at 02:27