merging df in pandas based on "contains" values

Question

I have 2 dfs

df_1

Nº.do Incidente Status  Description Per_Extracao
0   IN6948271   ENCERRADO   GR26 D.I.T.I. >>> ABEND NO JOB PP_SASG_GD9822...    DE : 2022/01/05 ATÉ : 2022/12/08
1   IN6948304   ENCERRADO   GR26 D.I.T.I. >>> ABEND NO JOB PP_AACE_R4539 ...    DE : 2022/01/05 ATÉ : 2022/12/08
2   IN6948307   ENCERRADO   GR26 D.I.T.I. >>> ABEND NO JOB PP_ADAT_SPRK_EX...   DE : 2022/01/05 ATÉ : 2022/12/08
3   IN6948309   ENCERRADO   GR26 D.I.T.I. >>> ABEND NO JOB PP_ADAT_SPRK_EX...   DE : 2022/01/05 ATÉ : 2022/12/08
4   IN6948310   ENCERRADO   GR26 D.I.T.I. >>> ABEND NO JOB PP_ADAT_SPRK_EX...   DE : 2022/01/05 ATÉ : 2022/12/08
5   IN6948311   ENCERRADO   GR26 D.I.T.I. >>> ABEND NO JOB PP_ADAT_SPRK_EX...   DE : 2022/01/05 ATÉ : 2022/12/08

df_2

    JOB_NAME    JOB_STREAM_NAME
0   PP_AACD_NR_D8706_TIHIBRIDA_PROC_EXCUC_D P26_AACD_FAC_TOD
1   PP_SASG_GD9822  P26_AACE_U08
2   PP_AACE_R4539   P26_AACE_U09
3   PP_AACE_R4539_CONS_JUNC P26_AACE_U08
4   PP_AACE_R4539_FMRC_TD_01    P26_AACE_U08
5   PP_AACE_R4539_FMRC_TD_02    P26_AACE_U08

I'm trying to merge then based on the value of JOB_NAME in df_2

the output should be something like this

merged_df

Nº.do Incidente Status  Description Per_Extracao JOB_NAME    JOB_STREAM_NAME
0   IN6948271   ENCERRADO   GR26 D.I.T.I. >>> ABEND NO JOB PP_SASG_GD9822...    DE : 2022/01/05 ATÉ : 2022/12/08 PP_SASG_GD9822  P26_AACE_U08
1   IN6948304   ENCERRADO   GR26 D.I.T.I. >>> ABEND NO JOB PP_AACE_R4539 ...    DE : 2022/01/05 ATÉ : 2022/12/08 PP_AACE_R4539   P26_AACE_U09

its not a regular join, its a contains contains condition("JOB_NAME" value in df_2 founded in "Description" column of df_1).

could you guys help me please?

Rabinzel · Accepted Answer · 2022-08-23T20:37:47.980

2

Based on this answer we can merge your dfs with some little changes.

pat = "|".join(map(str, df2['JOB_NAME']))
df1.insert(0, 'merge_key', df1['Description'].str.extract("("+pat+")", expand=False))

df = df1.merge(df2, left_on='merge_key', right_on='JOB_NAME').drop('merge_key', axis=1)
print(df)

       Nº.do Incidente Status                                  Description  jobName                      Per_Extracao        JOB_NAME JOB_STREAM_NAME
0  ENCERRADO             GR26  D.I.T.I. >>> ABEND NO JOB PP_SASG_GD9822...      NaN  DE : 2022/01/05 ATÉ : 2022/12/08  PP_SASG_GD9822    P26_AACE_U08
1  ENCERRADO             GR26  D.I.T.I. >>> ABEND NO JOB PP_AACE_R4539 ...      NaN  DE : 2022/01/05 ATÉ : 2022/12/08   PP_AACE_R4539    P26_AACE_U09

Explanation:

The variable pat contains one big string with all jobnames we are searching for, seperated by |.

Then we insert a new column called merge_key at the 1st position (which doesn't really matter here, it could also be the last column) with str.extraxt, which searchs in the column df1['description'] in every row, if there is a match with any name of all job names in pat and if there is a match, it adds that match to the new column (we seperated them with | because in regex the | stands for or, the () are needed to make it caputered groups (see the documentation) for more details on regular expressions.

With the new column we can merge them two df on merge_key and JOB_NAME At the end we drop that column merge_key since we only needed it for the merging.

edited Aug 23 '22 at 20:37

answered Aug 22 '22 at 20:25

Rabinzel

7,757
3
10
30

hey @Rabinzel, thanks for your reply, I got the error: TypeError Traceback (most recent call last) in ----> 1 pat = "|".join(jobNameList_df_2['JOB_NAME']) 2 dados_df_2.insert(0, 'merge_key', dados_df_2['Breve Descrição'].str.extract("("+pat+")", expand=False)) 3 4 df_final = dados_df_2.merge(jobNameList_df_2, left_on='merge_key', right_on='JOB_NAME').drop('merge_key', axis=1) TypeError: sequence item 1353: expected str instance, float found – gfernandes Aug 23 '22 at 20:05
consider dados_df_2 in my comment as df_1 in my post jobNameList_df_2 as df_2 in my post column 'Breve Descrição' as description – gfernandes Aug 23 '22 at 20:06
I'll update my answer where all the values of the job name get joined together to the variable `pat`. See if it works now – Rabinzel Aug 23 '22 at 20:15
I believe I need to convert the JOB_NAME column to string : I'm trying the 2 options below and none are working: jobNameList_df_2['JOB_NAME'] = jobNameList_df_2['JOB_NAME'].astype(str) jobNameList_df_2['JOB_NAME'] = jobNameList_df_2['JOB_NAME'].map(str) – gfernandes Aug 23 '22 at 20:17
check my answer again. If that doesn't work, please print the whole column of job name and copy paste the data to your question with [edit](https://stackoverflow.com/posts/73450488/edit) – Rabinzel Aug 23 '22 at 20:19
that worked @Rabnzel, really thanks, but, just for me be able to understand, would you mind to explain what exactly the code is doing? I mean : pat = "|".join(map(str, df2['JOB_NAME'])) -> here, are you creating a list of the values present in the JOB_NAME column spplited by "|"? df1.insert(0, 'merge_key', df1['Description'].str.extract("("+pat+")", expand=False)) - and here? df = df1.merge(df2, left_on='merge_key', right_on='JOB_NAME').drop('merge_key', axis=1) - here you are doing the merge, what does mean the .drop('merge_key', axis=1)? – gfernandes Aug 23 '22 at 20:25
I updated my answer with an explanation. – Rabinzel Aug 23 '22 at 20:38

score -1 · Answer 2 · answered Aug 22 '22 at 20:38

The easiest way is:

- Imagine this is your first csv file ::

and this is second csv file ::

So as you can see these 2 data is like yours on pattern , there are only 2 rows on Sensor_2 column that are equal to the other file . ( check 1st and 2nd row on both tables)

now what we can do is :

import pandas as pd 

df1 = pd.read_excel("myData1.xlsx")
df1.head()

df2 = pd.read_excel("myData2.xlsx")
df2.head()

df_merged = df1.merge(df2,on=["Sensor_2"],suffixes=('', '_drop'))
df_merged.drop([col for col in df_merged.columns if 'drop' in col], axis=1, inplace=True)
print(df_merged.head())

and the result is :

   Sensor_1  Sensor_2  Sensor_3  Year
0        10         1       1.5  2020
1        11         2       2.0  2020

You did a nice effort explaining your approach, but that is not what the OP is searching for. In his data, there is no exact match between the 2 df's. He needs some way to check if a substring in one df is in the string on the other df. That's why `pd.merge` isn't possible right away. — Rabinzel, Aug 22 '22 at 20:47
as @Rabinzel said, I don't have exac match between the 1 dfs, this solution is good, but not what I'm looking for — gfernandes, Aug 23 '22 at 20:08

merging df in pandas based on "contains" values

2 Answers2

Linked