0

I would like to get this result from these two DataFrames

df1 = pd.DataFrame({'url': [
  'http://google.com/men', 
  'http://google.com/women', 
  'http://google.com/men-shoes',
  'http://google.com/women-shoes',
  'http://google.com/not-important',
], 'click': [3, 4, 6, 5, 8]})

df2 = pd.DataFrame({'keyword': ['men','women','shoes', 'kids']})

Result:

  keyword  instances  clicks
0     men          2     9.0
1   women          2     9.0
2   shoes          2     11.0
3    kids          0     0.0

Which is basically counting how many times any df2 keywords appears on any df1 url column then merge to check those rows for a cumulative sum of click column on df1

I am struggling to get this result, thanks.

Álvaro
  • 2,255
  • 1
  • 22
  • 48

2 Answers2

1

You can use my fuzzy_merge function I wrote, combining it with explode and groupby, we get quite close to your result, note this is still fuzzy matching, so that's why there's a difference.

You can try to play with the threshold argument to get your desired result:

mrg = (
    fuzzy_merge(df1, df2, 'url', 'keyword')
     .explode('matches')
     .groupby('matches').agg({'matches':'size',
                              'click':'sum'})
)

df2['instances'] = df2['keyword'].map(mrg['matches']).fillna(0)
df2['clicks'] = df2['keyword'].map(mrg['click']).fillna(0)

  keyword  instances  clicks
0     men        2.0     7.0
1   women        2.0     9.0
2   shoes        2.0    11.0
3    kids        0.0     0.0

Function used from linked answer:

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
    """
    df_1 is the left table to join
    df_2 is the right table to join
    key1 is the key column of the left table
    key2 is the key column of the right table
    threshold is how close the matches should be to return a match, based on Levenshtein distance
    limit is the amount of matches that will get returned, these are sorted high to low
    """
    s = df_2[key2].tolist()

    m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))    
    df_1['matches'] = m

    m2 = df_1['matches'].apply(lambda x: [i[0] for i in x if i[1] >= threshold])
    df_1['matches'] = m2

    return df_1
Erfan
  • 40,971
  • 8
  • 66
  • 78
1

You can try this: It will extract the last part of URL after the / and split it with - (maybe it will be enough for your case):

df1['keyword'] = df1['url'].str.extract(r'/([^/]+?)$')[0].str.split(r'-')
print( pd.merge(df1.explode('keyword'), df2, how='right')
         .groupby('keyword').agg({'click': 'sum', 'url': lambda x: x[~x.isna()].count()  })
         .rename(columns={'click': 'clicks', 'url':'instances'})
         .reset_index() )

Prints:

  keyword  clicks  instances
0    kids     0.0          0
1     men     9.0          2
2   shoes    11.0          2
3   women     9.0          2
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • Actually I had to do this on the `agg` `value: lambda x: x.astype(float).sum()` as it was concatenating the numbers as strings – Álvaro Jan 20 '20 at 18:08