0

I'm writing a simple python script so I can test my websites from a different ip address.

The url of a page is given in the querystring, the script fetches the page and displays it to the user. The code below is used to rewrite the tags that contain urls but I don't think it's complete/totally correct.

def rel2abs(rel_url, base=loc):
    return urlparse.urljoin(base, rel_url)

def is_proxy_else_abs(tag, attr):
    if tag in ('a',):
        return True
    if tag in ('form', 'img', 'link') and attr in ('href', 'src', 'action', 'background'):
        return False

def repl(matchobj):
    if is_proxy_else_abs(matchobj.group(1).lower(), matchobj.group(3).lower()):
        return r'<%s %s %s="http://%s?%s" ' %(proxy_script_url, matchobj.group(1), matchobj.group(2), matchobj.group(3), urllib.urlencode({'loc':rel2abs(matchobj.group(5))}))
    else:
        return r'<%s %s %s="%s" ' %(matchobj.group(1), matchobj.group(2), matchobj.group(3), rel2abs(matchobj.group(5)))

def fix_urls(page):
    get_link_re = re.compile(r"""<(a|form|img|link) ([^>]*?)(href|src|action|background)\s*=\s*("|'?)([^>]*?)\4""", re.I|re.DOTALL)
    page = get_link_re.sub(repl, page)
    return page

The idea is that 'a' tag's href attributes should be routed through the proxy script, but css, javascript, images, forms etc should not be, so these have to be made absolute if they are relative in the original page.

The problem is the code doesn't always work, css can be written in a number of ways etc. Is there a more comprehensive regex I can use?

  • Probably a silly question, but have you consider simply writing a REAL http proxy? With a real proxy you shouldn't have to rewrite anything since your browser will be explicitly configured to use it. It will generally work a lot better, and be a lot easier to write. – Zoredache Dec 29 '08 at 20:07

1 Answers1

3

Please read other postings here about parsing HTML. For example Python regular expression for HTML parsing (BeautifulSoup) and HTML parser in Python.

Use Beautiful Soup, not regular expressions.

Community
  • 1
  • 1
S.Lott
  • 384,516
  • 81
  • 508
  • 779