I have this problem where I am using the hostnames of all the URLs I have in my dataset as features. I'm not able to figure out how to use TfidfVectorizer to extract hostnames only from the URLs and calculate their weights. For instance, I have a dataframe df where the column 'url' has all the URLs I need. I thought I had to do something like:
def preprocess(t):
return urlparse(t).hostname
tfv = TfidfVectorizer(preprocessor=preprocess)
tfv.fit_transform([t for t in df['url']])
It doesn't seem to work this way, since it splits the hostnames instead of treating them as whole strings. I think it's to do with analyzer='word' (which it is by default), which splits the string into words.
Any help would be appreciated, thanks!