I have two XLS sheets that I've read into two dataframes. I would like to find the intersection between the two, where the contents of one df column is contained in a specific column in the other df. And from that intersection, I'd like to create a new df that contains the intersection AND the values from a particular column in one of those dfs.
I cannot figure out what function in Pandas to get that intersection to work, and to include the data from the additional column.
This is what I tried to get the intersection:
import pandas as pd
test_file = '/Users/Bill/Box Sync/Documents/Jupyter/test_data.xlsx'
# read in the xls sheets
prj_df = pd.read_excel(test_file, sheet_name = 'Sheet1')
opp_df = pd.read_excel(test_file, sheet_name = 'Sheet2')
# intersect the sheets
report_df = opp_df[opp_df['opp'].isin(prj_df['opp numbers'])]
...but as you'll see below, report_df only includes the rows where the contents of "opp" and "opp numbers" match exactly. I need something that is more like the Python "in" keyword when looking for a string in a string.
Here is some test data:
prj_df:
Project opp numbers URL
0 Project 1 title ab-cdefg;12-34567 http://1.2.3.4/abc
1 project 2 title MJ-98733 http://1.2.3.4/UJUJUJUJ
2 Project 3 title No code http://1.2.3.4/99a
3 Project A title 01-PKL23 http://1.2.3.4/azzz
4 project B title 44-0098876 http://test.133
5 project c title 342-0981;98-09913 http://2.3.4.5/iiihh
6 project 99 title 25-AAAJJ12;99-49494 http://1.2.3.4/ghhi
7 project 303 title 77-AUDIJJ http://1.2.3.4/def
opp_df:
0 opp product price
1 98-09913 widget1 123.55
2 66-99999 widget2 44.99
3 AB-DEFFF1 widget3 2345.5
4 01-PKL23 item a 9990
5 77-AUDIJJ item b 84.56
6 KK-KIEW89 product 99 22.99
7 WE-24422 name 123 1.99
8 12-34567 stuff 1 395.5
report_df:
opp product price
3 01-PKL23 item a 9990.00
4 77-AUDIJJ item b 84.56
....but that data in report_df isn't complete: I also need row 1 and row 8 to show up because the 'opp' values from those rows are also contained in prj_df.
Now, in addition to getting the intersection correct, what I'd like to do is to add the values from the URL and Project columns of prj_df onto the end of report_df. Since I've been using the openpyxl library, my brain is only tuned to iterating through the sheet/dataframe and searching prj_df for the values instead of using the isin intersection I already did.
Is there a way to a) get that intersection to work, and b) pull in the Project and URL values into that intersected dataframe?