0

TL;DR: I'm trying to change the FQDN of a URL but keep the port using python's re.sub.

Example Input:

http://www.yahoo.com:80/news.html
http://new.news.com/news.html
https://www.ya.com:443/new.html
https://www.yahoots.com/new.html

Example Output:

http://www.google.com:80/news.html
http://www.google.com/news.html
https://www.google.com:443/new.html
https://www.google.com/new.html

And here's my sample code that is producing the output:

sed -e 's|//[^:]*\(:[0-9]*\)*/|//www.google.com\1/|'  < input

That seems to work fine. In short, I'm looking to replace everything between the // and the next /, but I want to keep the port (if specified) in tact.

However, the python version doesn't work so well:

re.sub( '//.*(:[0-9]*)*/' , '//' + 'www.google.com\\1' + '/' , 'http://www.yahoo.com/news.m3u8' )

Yields:

sre_constants.error: unmatched group

However it works if the port is present:

re.sub( '//.*(:[0-9]*)*/' , '//' + 'www.google.com\\1' + '/' , 'http://www.yahoo.com:80/news.m3u8' )

Should be simple, but I figured this would hopefully spark a useful discussion as to how sed and python use different regex expressions. At the very least, someone smarter than me can tell me what I'm doing wrong. I've considered avoiding the problem entirely by restructuring the program or using a url parsing library, but I am curious as to what is up with python's regex. I am also worried that (: has some special meaning to python re library.

Mark
  • 4,249
  • 1
  • 18
  • 27

2 Answers2

1

You need to use the right tool for the right job. urlparse is that tool.

from urllib.parse import urlparse #python 3

url = 'http://www.yahoo.com:80/news.html'
url = urlparse(url)
url = url._replace(netloc="{}:{}".format('www.google.com', url.port)) # Mark's edit
print(url.geturl()) # Mark's edit

EDIT: Recently (July 6 2023) I tried this code and found that the _replace() function returns a new ParseResult (leaving the current one unmutated). I've added an assignment to line before the print to update this code. I've also added the () to print for python3 reasons. Maybe the new _replace() behavior is a python2/python3 difference as well (please excuse the lack of additional research I should have done).

Mark
  • 4,249
  • 1
  • 18
  • 27
eatmeimadanish
  • 3,809
  • 1
  • 14
  • 20
1

But if you are using Python 2 or just want to use a Regex:

import re

url = 'http://www.yahoo.com:80/news.html'
new_url = re.sub(r'(?<=://)(.*?)(?=[:/])', 'www.google.com', url)
print new_url
Booboo
  • 38,656
  • 3
  • 37
  • 60