1

I have been using a regex that searches a document for all URLS and replaces them but now I want to only replace the hostname, not the subdomain or any other part of the URL.

For example I want https://ftp.website.com > https://ftp.mything.com

This is a tool I am writing to sanitize documents and am fairly new to some of this. Any help would be greatly appreciated. Thanks!

This is my quick and dirty find and replace so far:

import fileinput
import re

for line in fileinput.input():
    line = re.sub(
        r'^(?:http:\/\/|www\.|https:\/\/)([^\/]+)',
        r'client.com', line.rstrip())
    line = re.sub(
        r'\b(\d{1,3}\.){2}\d{1,3}\b',
        r'1.33.7', line.rstrip())
    print(line)

I realize that URL parse can accomplish this but I want this to find the URLs in the document and I do not want to supply them. Maybe I just need help using regex to find the urls and passing that to urlparse to remove the parts I want. Hope this clarifies.

zek
  • 11
  • 3
  • This question is identical to [this](https://stackoverflow.com/questions/21628852/changing-hostname-in-a-url) – jeron Oct 06 '17 at 22:35
  • Possible duplicate of [Changing hostname in a url](https://stackoverflow.com/questions/21628852/changing-hostname-in-a-url) – Daniel Trugman Oct 06 '17 at 23:14
  • 1
    I do not want to specify a url, I want to search for all URLs in the document and just replace the domain. – zek Oct 06 '17 at 23:40

2 Answers2

0

My solution below will separate the URL to 3 groups: before host, hostname, and afterhost:

import re
regex = r"^(http[:\/\w\.]*[/.])(\w+)(\.[\w\/]+)$"

target = "http://olddomain.com"
print re.sub(regex,r"\1newdomain\3",target)
# 'http://newdomain.com'

target = "http://ftp.olddomain.com"
print re.sub(regex,r"\1newdomain\3",target)
# 'http://ftp.newdomain.com'

target = "https://sub.sub.olddomain.com/sub/sub"
print re.sub(regex,r"\1newdomain\3",target)
# 'https://sub.sub.newdomain.com/sub/sub'

target = "how.about.this"
print re.sub(regex,r"\1newdomain\3",target)
# 'how.about.this'
malioboro
  • 3,097
  • 4
  • 35
  • 55
0
import fileinput
import re

regex = r"(^.*http\://(?:www\.)*)\S+?((?:\.\S+?)*/.*$)"

for line in fileinput.input():
    print re.sub(regex,r"\1newdomain\2",line)

# targets = [ "http://olddomain.com/test/test" , "this urel http://www.olddomain.com/test/test dends" ]
#
# for target in targets:
#     print re.sub(regex,r"\1newdomain\2",target)

gives when the comments are removed and the file input is commented out. I've left it in this so it will work as requested.

python /tmp/test2.py
http://newdomain.com/test/test
this urel http://www.newdomain.com/test/test dends
Calvin Taylor
  • 664
  • 4
  • 15