2

I have two dataframes, and I would Like to join df1 to df2 where df1 contains a url and df2 contains a list of urls.

The shape of df1 and df2 are different

Example:

df1 = pd.DataFrame({'url': ['http://www.example.jp/pro/sanada16']})
df2 = pd.DataFrame({'urls': ['[https://www.example.jp/pro/minoya, http://www.example.jp/pro/tokyo_kankan, http://www.example.jp/pro/briansawazakiphotography, http://www.example.jp/pro/r_masuda, http://www.example.jp/pro/sanada16, ......]']})

I would like the datafrmes to join on the condition that http://www.example.jp/pro/sanada16 in df1.url exists in df2.urls.

I thought about making the list in columns to columns, but the number of URLs is not unique in df2.urls.

I tried to add the df1.url substring that matched the df2.urls to a new column, so that I could join on the new column, but I couldn't get it to work:

df2['match'] = df2['urls'].apply(lambda x: x if x in df1['url'])

expected output:

new_df = pd.DataFrame({'url': ['http://www.example.jp/pro/sanada16'], 'urls': ['[https://www.example.jp/pro/minoya, http://www.example.jp/pro/tokyo_kankan, http://www.example.jp/pro/briansawazakiphotography, http://www.example.jp/pro/r_masuda, http://www.example.jp/pro/sanada16, ......]']})

With postgresql I could do:

SELECT 
    b.url
    ,a.urls
FROM df2 a
join df1 b 
    on position(b.url in a.urls)>0
kaws
  • 105
  • 9
  • 2
    can you create small sample dataframes and your expected output please? Check [how to create good pandas example](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – anky Mar 17 '19 at 16:59
  • thanks, added an image – kaws Mar 17 '19 at 17:32
  • Please post data and code as text, not images. – Peter Leimbigler Mar 17 '19 at 19:17
  • Hi, I have an working answer for you but I won't post it until you take the advice above, and provide short example input and output dataframes in a reproducible way, i.e. not as images – smj Mar 17 '19 at 19:58
  • 1
    Thanks guys, appreciate the education – kaws Mar 17 '19 at 20:15

1 Answers1

0

Here is one way, if I understand correctly. You can iterate over the patterns you want to search for, and then store the matches using df.at.

import pandas as pd

data_1 = pd.DataFrame(
    {
        'url': ['http://www.ex.jp', 'http://www.ex.com']
    }
)

data_2 = pd.DataFrame(
    {
        'url': ['http://www.ex.jp/pro', 'http://www.ex.jp/pro/test', 'http://www.ex.com/path', 'http://www.ex.com/home']
    }
)

result = pd.DataFrame(columns = ['pattern', 'matches'])

for i in range(data_1.shape[0]):

    result.loc[i, 'pattern'] = data_1.loc[i, 'url']

    result.at[i, 'matches'] = [j for j in data_2['url'] if data_1.loc[i, 'url'] in j]

print(result)

Gives:

             pattern                                            matches
0   http://www.ex.jp  [http://www.ex.jp/pro, http://www.ex.jp/pro/test]
1  http://www.ex.com   [http://www.ex.com/path, http://www.ex.com/home]

Kudos for updating your question as requested.

smj
  • 1,264
  • 1
  • 7
  • 14
  • Thank you for your answer. I need to be able to merge df1 and df2 because there are other columns that I need to add. This would create a new dataframe where df1.url matches result.pattern, but df2.urls does not match result.matches – kaws Mar 18 '19 at 22:53