So this question concerns how to select a subset of rows in a data frame based on values in an array (or a single column). It is not enough for me to solve my problem.
I have many different tables in multiple directories. I have a dictionary with relations between tables (e.g keys for join). For each table T1, I lookup other tables (T2, T3...) that share same column names (keys) and I want to filter those tables (T2, T3...) to include rows that have matching key values in a set of columns with T1. Key set may vary! T1 may connect to T2 on one column (key) while T1 may connect with T2 on 5 keys! I do not know this beforehand.
So for example I have t1, t2, t3
and pks=["id"] (t1-->t2), fks=["id", "index", "zip"] (t1-->t3)
t1
id|index|zip|v
10|10000|200|20
t2
id|v
10|30
20|50
30|70
t3
id|index|zip|v
00|10000|200|10
10|10000|200|20
10|10000|300|30
10|10000|200|10
the output of t2 and t3 would be
t2
id|v
10|30
and t3
id|index|zip|v
10|10000|200|20
10|10000|200|10
Looking at the previous answer I would probably need to do smth like
filtered_t2 = t2.loc[t2[pks].isin(t1[fks])]
But i get the following error
ValueError: Cannot index with multidimensional key
Probably in this way I cannot handle compound key, but it also fails if I just provide one key -- 'id'! So maybe it cannot accept an array as values ...
How do I handle it when pks
and fks
are arrays of variable sizes?
Would this be a correct approach:
filter = None
for p, f in zip(pks, fks):
if filter is None:
filter = t2[p].isin(t1[f])
else:
filter &= t2[p].isin(t1[f])
filtered_ft = t2.loc[filter]
Thanks!