Filtering a data frame with multiple column values of another data frame

Question

I have 2 data frames as below:

df1={"group":["A","B"],
 "unit":["U1","U2"],
 "char1":["C1","C2"],
 "char2":["Large","Medium"],
 "char3":["","R"]
}
df1=pd.DataFrame.from_dict(df1)

and

df2={"char1":["C1","C1","C1","C1","C2","C2","C2","C3","C3"],
"char2":["Large","Large","Large","Large","Medium","Medium","Medium","Medium","Large"],
 "char3":["U","U","U","R","R","R","U","R","R"],
 "result":[113,114,115,116,818,819,1101,1102,1103]}

df2=pd.DataFrame.from_dict(df2)

I want to filter df2 with the values of df1 columns(char1,char2,char3). Finally, for each group_unit pair in df1, I want to assign filtered results from df2.

The desired output would look like the following:

output={"group":["A","A","A","A","B","B"],
    "unit" :["U1","U1","U1","U1","U2","U2"],
    "result":[113,114,115,116,818,819]}
output=pd.DataFrame.from_dict(output)

I have tried using "isin" and some other stuff but couldn't manage to get there. I appreciate with any kind of soft coding solution.

your output is not correct if you want to "filter" on char1 char2 and char3 — It_is_Chris, Feb 02 '19 at 19:23
@Chris, you're right. I corrected the example output. Thanks for the warning. — nkltkf, Feb 02 '19 at 23:03

Valdi_Bo · Answer 1 · 2019-02-03T11:50:40.080

You can do this task with a single instruction:

pd.merge(df1, df2, on=['char1', 'char2'])[['group', 'unit', 'result']]

merge (not surprisingly) merges df1 and df2, but the result contains also char1, char2, char3_x and char3_y columns, so you need to "restrict" the output to the columns you want.

Note that the result column list is in double square brackets (you may say: a weird notation).

The reason is that:

"outer" pair is a "normal" pair for an index,
but because instead of a single column name we have a list of column names, it must be enclosed with another pair of square brackets.

And a little remark concerning your expected output:

The last "your" row (B, U2, 1102) should not actually be included. Note that the last but one row in df2 (C3, Medium, R, 1102) contains C3 as char1 and Medium as char2, but df1 does not contain any row containing these values.

Edit

You wrote that you want to filter (actually merge) on all three char1, char2 and char3 columns.

So probably:

Dictionary for df1 should containt 'char3':['U', 'R'] (note the added 'U').
The merge instruction should be: pd.merge(df1, df2) (with default merge criterion of all common columns), the "[[...]]" part as before.

Then the result would contain:

  group unit  result
0     A   U1     113
1     A   U1     114
2     A   U1     115
3     B   U2     818
4     B   U2     819
5     B   U2    1101

i.e. three rows containing C1, Large and U and another three rows containing C2, Medium and R.

Edit 2

In order to perform the filtering the way you explained in your comment, you have to split df1 into 2 DataFrames:

df1a containig an empty string in char3, dropping char3 column (the reason is explained below),
df1b containig not empty string in char3.

To do it run:

df1a = df1.query('char3.str.len() == 0').drop('char3', axis=1)
df1b = df1.query('char3.str.len() > 0')

Then you should concatenate two partial merges:

df1a with df, using the default merge criterion - all common columns (char1 and char2, so now it becomes clear why we dropped char3),
df1b with df, again - on all common columns (this time char1, char2 and char3),
leaving only group, unit and result columns.

To do it run:

pd.concat([pd.merge(df1a, df2), pd.merge(df1b, df2)])[['group', 'unit', 'result']]

you are right regarding your remark on the desired output. I corrected, thanks. I also made an edit on the df2 to be more explicit. When it comes to your remark on df1, I created 'char3':[' ', 'R'] on purpose. Because I want the code to bring all results under "char3" if the char value is blank/null. For example, A_U1 pair should get all possible results (113,114,115,116) since there is no "char3" criteria specified for this pair. One last point: You did not use any char3 in your solution. However, B_U2 pair suggests the "R" for "char3" criteria. How you would write the code considering those? — nkltkf, Feb 03 '19 at 00:24

Filtering a data frame with multiple column values of another data frame

1 Answers1

Edit

Edit 2