0

I have a string with multiple URLs and some text in between.

How can I replace each URL with their hostname and top-level-domain?

Example Input: www.google.com some text google.com some text http://google.com some text https://stackoverflow.com/questions/ask

Desired Output: google.com some text google.com some text google.com some text stackoverflow.com

I've found the Python module tldextract but that just helps with extracting hostname + tld but not with finding and replacing all URLs

Thanks in advance!

Tom
  • 151
  • 1
  • 1
  • 8

3 Answers3

1

You can also use regex with the logic below:

  1. (http[s]?://) --> Capture http:// or https://
  2. (www\.) --> Capture www.
  3. (?<=.[a-z][a-z][a-z])(/[^ ]*) Capture anything past .com with slashes, excluding .com (also other domains, like org, net, as long as 3-letter long)
yourString = 'www.google.com some text google.com some text http://google.com some text https://stackoverflow.com/questions/ask'

re.sub(r'(http[s]?://)|(?<=.com)(/[^ ]*)|(www\.)', '', yourString)

Out[1]:'google.com some text google.com some text google.com some text stackoverflow.com'
realr
  • 3,652
  • 6
  • 23
  • 34
0

You could just replace 'www' (etc.) with '' for the part before the domain, but that solution ignores everything after the suffix which can't be predicted.

Try this:

import tldextract

somestr = 'www.google.com some text google.com some text http://google.com some text https://stackoverflow.com/questions/ask'

newstr = ''

for word in somestr.split(' '):
    extracted = tldextract.extract(word)
    if extracted.domain != '' and extracted.suffix != '':
        newstr += extracted.domain + '.' + extracted.suffix + ' '
    else:
        newstr += word + ' '

print(newstr)
aybry
  • 316
  • 1
  • 7
0

Here is another version on pandas column using "re" and "tldextract":

import re
import tldextract

#define the regex pattern to catch any url (try it on regex101.com)
ANY_URL_REGEX = re.compile(r"""(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))""")

#you may want to lower the string in your column
data['column'] = data['column'].str.lower()
#to simplify the process create 2 more columns
#1- that catches the full url example xyz.co.uk or asd.am.edu or google.com
#2- that catches the domain in that full url

data['url'] = data['column'].str.extract(ANY_URL_REGEX, expand=False)
data['domain'] = data['column'].str.extract(ANY_URL_REGEX, expand=False).apply(lambda url: tldextract.extract(url).domain if pd.notnull(url) else '')

#now apply on column to find any "URL" and replace it with "domain"
data['column'] = data.apply(lambda x: str(x['coalesced_brand']).replace(str(x['url']),x['domain']), axis=1)

Note: this sample code extracts sample_site out of (http(s)://www.)sample_site.com/whatever. you can modify it to extract sample_site.com

Ehsan
  • 711
  • 2
  • 7
  • 21