0

Basically, I have a dataframe where one column is a list of names, and the other is associated URLs that are related to the name in some way (sample df):

   Name                    Domain
'Apple Inc'             'https://mapquest.com/askjdnas387y1/apple-inc', 'https://linkedin.com/apple-inc/askjdnas387y1/', 'https://www.apple-inc.com/asdkjsad542/'     
'Aperture Industries'   'https://www.cakewasdelicious.com/aperture/run-away/', 'https://aperture-incorporated.com/aperture/', 'https://www.buzzfeed.com/aperture/the-top-ten-most-evil-companies=will-shock-you/'
'Umbrella Corp'         'https://www.umbrella-corp.org/were-not-evil/', 'https://umbrella.org/experiment-death/', 'https://www.most-evil.org/umbrella-corps/'

I'm trying to find the URLs that have the keyword or at least a partial match to the keyword directly AFTER either:

'https://NAME.whateverthispartdoesntmatter'  # ...or...
'https://www.NAME.whateverthispartdoesntmatter' # <- not a real link

Right now I'm using fuzzywuzzy package to gain the partial matches:

fuzz.token_set_ratio(name, value)

It works great for partial matching, however the matches aren't location dependent, so I'll get a perfect keyword match but its located somewhere in the middle of the URL which isn't what I need like:

https://www.bloomberg.com/profiles/companies/aperture-inc/0117091D 
smci
  • 32,567
  • 20
  • 113
  • 146
Sebastian Goslin
  • 477
  • 1
  • 3
  • 22
  • you can check if this helps: https://stackoverflow.com/questions/56521625/quicker-way-to-perform-fuzzy-string-match-in-pandas#56521804 – anky Jun 10 '19 at 18:30
  • You're making this more complicated than it needs to be. You simply want a regex with cpature group to capture the leading `https://(www\.)?`, then split on following '.' or '-', then test for `.startswith` on the lead word. No need for fuzzywuzzy, it's overkill, it'll slow you down, and give false-positives. – smci Jan 17 '21 at 18:51

1 Answers1

1

Using explode/unnest string, str.extract & fuzzywuzzy

First we will unnest your string to rows using this function:

df = explode_str(df, 'Domain', ',').reset_index(drop=True)

Then we use regular expressions to find the two patterns with or without the www and extract the names from them:

m = df['Domain'].str.extract('https://www.(.*)\.|https://(.*)\.')
df['M'] = m[0].fillna(m[1])
print(df)


                  Name                                             Domain                      M
0            Apple Inc       https://mapquest.com/askjdnas387y1/apple-inc               mapquest
1            Apple Inc      https://linkedin.com/apple-inc/askjdnas387y1/               linkedin
2            Apple Inc             https://www.apple-inc.com/asdkjsad542/              apple-inc
3  Aperture Industries  https://www.cakewasdelicious.com/aperture/run-...       cakewasdelicious
4  Aperture Industries        https://aperture-incorporated.com/aperture/  aperture-incorporated
5  Aperture Industries   https://www.buzzfeed.com/aperture/the-top-ten...               buzzfeed
6        Umbrella Corp       https://www.umbrella-corp.org/were-not-evil/          umbrella-corp
7        Umbrella Corp             https://umbrella.org/experiment-death/               umbrella
8        Umbrella Corp          https://www.most-evil.org/umbrella-corps/              most-evil

Then we use fuzzywuzzy to filter the rows with a higher match than 80:

from fuzzywuzzy import fuzz

m2 = df.apply(lambda x: fuzz.token_sort_ratio(x['Name'], x['M']), axis=1)

df[m2>80]


            Name                                        Domain              M
2      Apple Inc        https://www.apple-inc.com/asdkjsad542/      apple-inc
6  Umbrella Corp  https://www.umbrella-corp.org/were-not-evil/  umbrella-corp

Note I used token_sort_ratio instead of token_set_ratio to catch the umbrella and umbrella-corp difference


Function used from linked answer:

def explode_str(df, col, sep):
    s = df[col]
    i = np.arange(len(s)).repeat(s.str.count(sep) + 1)
    return df.iloc[i].assign(**{col: sep.join(s).split(sep)})
Erfan
  • 40,971
  • 8
  • 66
  • 78
  • This is perfect, and works, however I forgot the to mention that Domain column has at least 3 different domains within it, so this does perfect at find partial/exact matches but it also keeps the other domains that accompany the exact matches – Sebastian Goslin Jun 10 '19 at 18:51
  • 1
    Not sure if I understand you correctly, could you edit your question accordingly so I can see what you mean @SebastianGoslin – Erfan Jun 10 '19 at 18:53
  • 1
    Edited with new answer, please try to ask your question correct in one go next time :) @SebastianGoslin – Erfan Jun 10 '19 at 19:31
  • quick question, how would I drop the rows that don't contain the keyword at the beginning, since they are still being added to the dataframe. – Sebastian Goslin Jun 10 '19 at 19:54
  • 1
    `df = df[m2>80]` @SebastianGoslin – Erfan Jun 10 '19 at 19:58
  • super late question, I'm still running into issues where some of the retrieved domains are throwaways like `mapquest` or `linkedin` is there a way to modify this to where the matched word after `https://` or `https://www.` is only the name, and if not, to not include it? – Sebastian Goslin Jun 11 '19 at 15:07
  • 1
    Sorry, but you cannot keep asking new questions in the comment. This one is answered. Feel free to post a new question and link it here, so I can help you @SebastianGoslin – Erfan Jun 11 '19 at 15:35