0

I am trying to get the html-content of several pages with python 2.7.3 and urllib2. For the most pages, it works fine, but some pages like http://www.bbc.co.uk/news/entertainment-arts-22441507#sa-ns_mchannel=rss&ns_source=PublicRSS20-sa return me this content:

This page is best viewed in an up-to-date web browser with style sheets (CSS) enabled. While you will be able to view the content of this page in your current browser, you will not be able to get the full visual experience. Please consider upgrading your browser software or enabling style sheets (CSS) if you are able to do so.

This problem also occurs with pages where javascript is required. I only get the content within the noscript-tag returned.

Here is how I get the content:

cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) 
response = urllib2.urlopen(url).read().decode("utf-8")

Are there additional headers needed?

Martin Golpashin
  • 1,032
  • 9
  • 28
  • 1
    Looks like User-agent detection to me. You could try adding a forged User-agent header that imitates a browser, e.g. `Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:20.0) Gecko/20100101 Firefox/20.0`. – Xion May 11 '13 at 13:27
  • Any reason why not to use way more friendly `requests` library? – SpankMe May 11 '13 at 14:04
  • I tried it with the requessts-library and the header from @Xion. Still no success, maybe I am doing anything wrong – Martin Golpashin May 11 '13 at 14:15

1 Answers1

0

Sounds like you're fetching the original HTML page, before javascript/ajax has had a go at it. Try using webkit to get the page with JavaScript applied. See here for an answer with links.

Community
  • 1
  • 1
alexis
  • 48,685
  • 16
  • 101
  • 161