Python's urllib2 follows 3xx redirects to get the final content. Is there a way to make urllib2 (or some other library such as httplib2) also follow meta refreshes? Or do I need to parse the HTML manually for the refresh meta tags?
Asked
Active
Viewed 1.1k times
5 Answers
12
Here is a solution using BeautifulSoup and httplib2 (and certificate based authentication):
import BeautifulSoup
import httplib2
def meta_redirect(content):
soup = BeautifulSoup.BeautifulSoup(content)
result=soup.find("meta",attrs={"http-equiv":"Refresh"})
if result:
wait,text=result["content"].split(";")
if text.strip().lower().startswith("url="):
url=text.strip()[4:]
return url
return None
def get_content(url, key, cert):
h=httplib2.Http(".cache")
h.add_certificate(key,cert,"")
resp, content = h.request(url,"GET")
# follow the chain of redirects
while meta_redirect(content):
resp, content = h.request(meta_redirect(content),"GET")
return content

asmaier
- 11,132
- 11
- 76
- 103
-
'url=text[4:]' should be 'url=text.strip()[4:]' to remove leading spaces Note als that I sometimes see REFRESH iso Refresh – Marc Van Daele Nov 20 '22 at 12:45
-
1I agree. I fixed the code as you suggested. – asmaier Nov 21 '22 at 14:09
5
A similar solution using the requests and lxml libraries. Also does a simple check that the thing being tested is actually HTML (a requirement in my implementation). Also is able to capture and use cookies by using the request library's sessions (sometimes necessary if redirection + cookies are being used as an anti-scraping mechanism).
import magic
import mimetypes
import requests
from lxml import html
from urlparse import urljoin
def test_for_meta_redirections(r):
mime = magic.from_buffer(r.content, mime=True)
extension = mimetypes.guess_extension(mime)
if extension == '.html':
html_tree = html.fromstring(r.text)
attr = html_tree.xpath("//meta[translate(@http-equiv, 'REFSH', 'refsh') = 'refresh']/@content")[0]
wait, text = attr.split(";")
if text.lower().startswith("url="):
url = text[4:]
if not url.startswith('http'):
# Relative URL, adapt
url = urljoin(r.url, url)
return True, url
return False, None
def follow_redirections(r, s):
"""
Recursive function that follows meta refresh redirections if they exist.
"""
redirected, url = test_for_meta_redirections(r)
if redirected:
r = follow_redirections(s.get(url), s)
return r
Usage:
s = requests.session()
r = s.get(url)
# test for and follow meta redirects
r = follow_redirections(r, s)

mlissner
- 17,359
- 18
- 106
- 169
-
Sometimes meta refresh redirects point to relative URLs. For instance, Facebook does ``. It would be good to detect relative URLs and prepend the scheme and host. – Joe Mornin Jul 18 '13 at 05:31
-
@JosephMornin: Adapted. I realized it still doesn't support circular redirects though...always something. – mlissner Jul 18 '13 at 16:53
1
OK, seems no library supports it so I have been using this code:
import urllib2
import urlparse
import re
def get_hops(url):
redirect_re = re.compile('<meta[^>]*?url=(.*?)["\']', re.IGNORECASE)
hops = []
while url:
if url in hops:
url = None
else:
hops.insert(0, url)
response = urllib2.urlopen(url)
if response.geturl() != url:
hops.insert(0, response.geturl())
# check for redirect meta tag
match = redirect_re.search(response.read())
if match:
url = urlparse.urljoin(url, match.groups()[0].strip())
else:
url = None
return hops

Onilton Maciel
- 3,559
- 1
- 25
- 29

hoju
- 28,392
- 37
- 134
- 178
1
If you dont want to use bs4 ,you can use lxml like this:
from lxml.html import soupparser
def meta_redirect(content):
root = soupparser.fromstring(content)
result_url = root.xpath('//meta[@http-equiv="refresh"]/@content')
if result_url:
result_url = str(result_url[0])
urls = result_url.split('URL=') if len(result_url.split('url=')) < 2 else result_url.split('url=')
url = urls[1] if len(urls) >= 2 else None
else:
return None
return url

chroming
- 358
- 2
- 7
-1
Use BeautifulSoup or lxml to parse the HTML.

Ignacio Vazquez-Abrams
- 776,304
- 153
- 1,341
- 1,358
-
using a HTML parser to just extract the meta refresh tag is overkill, atleast for my purposes. Was hoping there was a Python HTTP library that did this automatically. – hoju Feb 24 '10 at 01:05
-
Well `meta` it *is* a html tag, so it is unlikely that you will find this functionality in an http library. – Otto Allmendinger Feb 24 '10 at 12:14