0

I have two XLS sheets that I've read into two dataframes. I would like to find the intersection between the two, where the contents of one df column is contained in a specific column in the other df. And from that intersection, I'd like to create a new df that contains the intersection AND the values from a particular column in one of those dfs.

I cannot figure out what function in Pandas to get that intersection to work, and to include the data from the additional column.

This is what I tried to get the intersection:

import pandas as pd
test_file = '/Users/Bill/Box Sync/Documents/Jupyter/test_data.xlsx'
# read in the xls sheets
prj_df = pd.read_excel(test_file, sheet_name = 'Sheet1')
opp_df = pd.read_excel(test_file, sheet_name = 'Sheet2')
# intersect the sheets
report_df = opp_df[opp_df['opp'].isin(prj_df['opp numbers'])]

...but as you'll see below, report_df only includes the rows where the contents of "opp" and "opp numbers" match exactly. I need something that is more like the Python "in" keyword when looking for a string in a string.

Here is some test data:

prj_df:

   Project            opp numbers          URL
0  Project 1 title    ab-cdefg;12-34567    http://1.2.3.4/abc
1  project 2 title    MJ-98733             http://1.2.3.4/UJUJUJUJ
2  Project 3 title    No code              http://1.2.3.4/99a
3  Project A title    01-PKL23             http://1.2.3.4/azzz
4  project B title    44-0098876           http://test.133
5  project c title    342-0981;98-09913    http://2.3.4.5/iiihh
6  project 99 title   25-AAAJJ12;99-49494  http://1.2.3.4/ghhi
7  project 303 title  77-AUDIJJ            http://1.2.3.4/def

opp_df:

0  opp        product     price
1  98-09913   widget1     123.55
2  66-99999   widget2     44.99
3  AB-DEFFF1  widget3     2345.5
4  01-PKL23   item a      9990
5  77-AUDIJJ  item b      84.56
6  KK-KIEW89  product 99  22.99
7  WE-24422   name 123    1.99
8  12-34567   stuff 1     395.5

report_df:

   opp        product     price
3  01-PKL23   item a      9990.00
4  77-AUDIJJ  item b      84.56

....but that data in report_df isn't complete: I also need row 1 and row 8 to show up because the 'opp' values from those rows are also contained in prj_df.

Now, in addition to getting the intersection correct, what I'd like to do is to add the values from the URL and Project columns of prj_df onto the end of report_df. Since I've been using the openpyxl library, my brain is only tuned to iterating through the sheet/dataframe and searching prj_df for the values instead of using the isin intersection I already did.

Is there a way to a) get that intersection to work, and b) pull in the Project and URL values into that intersected dataframe?

B. Seubert
  • 66
  • 4

1 Answers1

0

This is easier done with pandas merge function. But to do that you have to first split your values to rows using this function.

After that you can do an inner merge:

prj_df = explode_str(df, 'opp numbers', sep=';')
print(prj_df)

             Project opp numbers                      URL
0    Project 1 title    ab-cdefg       http://1.2.3.4/abc
0    Project 1 title    12-34567       http://1.2.3.4/abc
1    project 2 title    MJ-98733  http://1.2.3.4/UJUJUJUJ
2    Project 3 title     No code       http://1.2.3.4/99a
3    Project A title    01-PKL23      http://1.2.3.4/azzz
4    project B title  44-0098876          http://test.133
5    project c title    342-0981     http://2.3.4.5/iiihh
5    project c title    98-09913     http://2.3.4.5/iiihh
6   project 99 title  25-AAAJJ12      http://1.2.3.4/ghhi
6   project 99 title    99-49494      http://1.2.3.4/ghhi
7  project 303 title   77-AUDIJJ       http://1.2.3.4/def

Now we can merge:

report_df = opp_df.merge(prj_df[['opp numbers']], 
                         left_on='opp', 
                         right_on='opp numbers').drop('opp numbers', axis=1)

print(report_df)
         opp  product    price
0   98-09913  widget1   123.55
1   01-PKL23   item a  9990.00
2  77-AUDIJJ   item b    84.56
3   12-34567  stuff 1   395.50

Function used from linked answer

def explode_str(df, col, sep):
    s = df[col]
    i = np.arange(len(s)).repeat(s.str.count(sep) + 1)
    return df.iloc[i].assign(**{col: sep.join(s).split(sep)})
Erfan
  • 40,971
  • 8
  • 66
  • 78
  • Thanks, Erfan. I'm getting an error walkback from the arange call; the error says "error: nothing to repeat at position 0" out of the sre_parse.py module. Does this code expect that every value in the column will have that separator? I cut down the sheet to one row; the column value in my test is "09-ABCS1AR + 01-1234567", and that throws that error. – B. Seubert May 08 '19 at 19:03