Basically, I have a dataframe where one column is a list of names, and the other is associated URLs that are related to the name in some way (sample df):
Name Domain
'Apple Inc' 'https://mapquest.com/askjdnas387y1/apple-inc', 'https://linkedin.com/apple-inc/askjdnas387y1/', 'https://www.apple-inc.com/asdkjsad542/'
'Aperture Industries' 'https://www.cakewasdelicious.com/aperture/run-away/', 'https://aperture-incorporated.com/aperture/', 'https://www.buzzfeed.com/aperture/the-top-ten-most-evil-companies=will-shock-you/'
'Umbrella Corp' 'https://www.umbrella-corp.org/were-not-evil/', 'https://umbrella.org/experiment-death/', 'https://www.most-evil.org/umbrella-corps/'
I'm trying to find the URLs that have the keyword or at least a partial match to the keyword directly AFTER either:
'https://NAME.whateverthispartdoesntmatter' # ...or...
'https://www.NAME.whateverthispartdoesntmatter' # <- not a real link
Right now I'm using fuzzywuzzy
package to gain the partial matches:
fuzz.token_set_ratio(name, value)
It works great for partial matching, however the matches aren't location dependent, so I'll get a perfect keyword match but its located somewhere in the middle of the URL which isn't what I need like:
https://www.bloomberg.com/profiles/companies/aperture-inc/0117091D