TL;DR: I'm trying to change the FQDN of a URL but keep the port using python's re.sub.
Example Input:
http://www.yahoo.com:80/news.html
http://new.news.com/news.html
https://www.ya.com:443/new.html
https://www.yahoots.com/new.html
Example Output:
http://www.google.com:80/news.html
http://www.google.com/news.html
https://www.google.com:443/new.html
https://www.google.com/new.html
And here's my sample code that is producing the output:
sed -e 's|//[^:]*\(:[0-9]*\)*/|//www.google.com\1/|' < input
That seems to work fine. In short, I'm looking to replace everything between the // and the next /, but I want to keep the port (if specified) in tact.
However, the python version doesn't work so well:
re.sub( '//.*(:[0-9]*)*/' , '//' + 'www.google.com\\1' + '/' , 'http://www.yahoo.com/news.m3u8' )
Yields:
sre_constants.error: unmatched group
However it works if the port is present:
re.sub( '//.*(:[0-9]*)*/' , '//' + 'www.google.com\\1' + '/' , 'http://www.yahoo.com:80/news.m3u8' )
Should be simple, but I figured this would hopefully spark a useful discussion as to how sed and python use different regex expressions. At the very least, someone smarter than me can tell me what I'm doing wrong. I've considered avoiding the problem entirely by restructuring the program or using a url parsing library, but I am curious as to what is up with python's regex. I am also worried that (:
has some special meaning to python re library.