I have two dataframes with different sizes, df1
and df2
. I'm trying to check if the values from df1
exist in a column in df2
and return True
or False
in new columns in df1
.
The first dataframe is my reference. It's extracted from an xls file.
df1.head(10)
Out[29]:
PO Number Sales Document SO DO Document Number
0 3620556930 9001724124.0 4001458660.0 8001721322.0 1500017748
1 3620556930 9001723883.0 4001458865.0 8001721037.0 1500017540
2 3620556930 9001723884.0 4001459374.0 8001721038.0 1500017541
3 3620556930 9001723885.0 4001458101.0 8001721043.0 1500017542
4 3620547728 9001721907.0 4001457180.0 8001719172.0 1500015786
5 3620556930 9001721908.0 4001457724.0 8001719173.0 1500015787
6 TT030720 nan nan nan 700001897
7 3620518726 9600008914.0 5600008655.0 5600008655.0 1500008725
8 3620518726 9600008912.0 5600008653.0 5600008653.0 1500008723
9 3620518726 9600008913.0 5600008654.0 5600008654.0 1500008724
The second dataframe is from a table I scraped from a website.
df2.head(10)
Out[32]:
PO No Doc Type SUS Doc No GR_GA Inv_SO_DO Doc Date
0 3620556930 Purchase Order 8001294233 CSL 27.08.2020
1 3620556930 Goods Receipt 7903307400 Goods Received 4001457724 04.09.2020
2 3620556930 Goods Receipt 7903307457 Goods Accepted 4001457724 04.09.2020
3 3620556930 Payment Request 3102053949 CCM Invoice 9001721908 23.09.2020
4 3620556930 Goods Receipt 7903333326 Goods Received 4001458660 29.09.2020
5 3620556930 Goods Receipt 7903333325 Goods Received 4001458101 29.09.2020
6 3620556930 Goods Receipt 7903333322 Goods Received 4001458865 29.09.2020
7 3620556930 Goods Receipt 7903333327 Goods Accepted 4001458660 29.09.2020
8 3620556930 Goods Receipt 7903333324 Goods Received 4001458660 29.09.2020
9 3620556930 Goods Receipt 7903333329 Goods Accepted 4001458865 29.09.2020
My thought process in getting the output is as below:
- I'll create three additional columns in
df1
, nameddf1['GR', 'GA', 'Inv']
. - I'll use the values from
df1['SO']
anddf1['DO']
to check if they exist indf2['Inv_SO_DO']
. - If the values exist, I'll then check
df2['GR_GA']
if it's a Goods Receipt, Goods Acceptance or Invoice. I'll then returnTrue
orFalse
in columnsdf1['GR', 'GA', 'Inv']
depending on this check.
I've tried a for
loop as below for to create a list of values to be added for ['GA']
but it just gave me a list of Falses.
ga = []
t1 = x.iloc[:,2].values
t2 = y.iloc[:,4].values
t3 = y.iloc[:,3].values
for i in t1:
for j in t2:
for k in t3:
if i == j and k == 'Goods Receipt':
ga.append('True')
else:
ga.append('False')
The closest I got to a solution is from another question here. I tried the code and modified it but it didn't turn out right as well. Either that, or I'm doing the code from the link wrong.
Any advise would be most welcomed!
Output desired:
df1.head(4)
Out[43]:
PO Number Sales Document SO DO Document Number GR GA Inv
0 3620556930 9001724124.0 4001458660.0 8001721322.0 1500017748 True True True
1 3620556930 9001723883.0 4001458865.0 8001721037.0 1500017540 True False False
2 3620556930 9001723884.0 4001459374.0 8001721038.0 1500017541 False False False
3 3620556930 9001723885.0 4001458101.0 8001721043.0 1500017542 True True False