This could be a place where regular expressions are the correct solution. You are only searching for text in attributes, and the contents are regular, fitting a pattern.
The following python regular expression would locate these links for you:
r'href="((?:\.\./)+external\.html\?link=)([^"]+)"'
The pattern we look for is something inside a href=""
chunk of text, where that 'something' starts with one or more instances of ../
, followed by external.html?link=
, then followed with any text that does not contain a "
quote.
The matched text after the equals sign is grouped in group 2 for easy retrieval, group 1 holds the ../../external.html?link=
part.
If all you want to do is remove the ../../external.html?link=
part altogether (so the links point directly to the endpoint instead of going via the redirect page), leave off the first group and do a simple .sub()
on your HTML files:
import re
redirects = re.compile(r'href="(?:\.\./)+external\.html\?link=([^"]+)"')
# ...
redirects.sub(r'href="\1"', somehtmlstring)
Note that this could also match any body text (so outside HTML tags), this is not a HTML-aware solution. Chances are there is no such body text though. But if there is, you'll need a full-blown HTML parser like BeautifulSoup or lxml instead.