It is not possible to parse a part of a webpage that is visible when open with browser

Question

I have this strange problem parsing the webpage Herald Sun to get the list of rss from it. When I look at the webpage in the browser, I can see the links with titles. Though, when I used Python and Beautiful Soup to parse the page, the response does not even have the section I would like to parse.

hdr = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9) AppleWebKit/537.71 (KHTML, like Gecko) Version/7.0 Safari/537.71',
               'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
               'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
               'Accept-Encoding': 'none',
               'Accept-Language': 'en-US,en;q=0.8',
               'Connection': 'keep-alive'}

req = urllib.request.Request("http://www.heraldsun.com.au/help/rss", headers=hdr)

try:
    page = urllib.request.urlopen(req)
except urllib.error.HTTPError as e:
    print(e.fp.read())

html_doc = page.read()

f = open("Temp/original.html", 'w')
f.write(html_doc.decode('utf-8'))

The written file as you can check, does not have the results in there, so obviously, Beautiful Soup has nothing to do here.

I wonder, how does the webpage enable this protection and how to overcome it? Thanks,

score 1 · Accepted Answer · answered Dec 05 '13 at 03:55

For commercial use, read the terms of services First

There are really not that much information the server know about who is making this request. Either IP, User-Agent or Cookie... Sometimes the urllib2 will not grab the information that are generated by JavaScript.

JavaScript or Not?

(1) You need to open up the chrome developer and disable the cache and Javascript to make sure that you can still see the information that you want. If you cannot see the information there, you have to use some tool that support Javascript like Selenium or PhantomJS. enter image description here

However, in this case, your website looks it is not that sophisticated.

User-Agent? Cookie? (2) Then the problem ends up tuning User-Agent or Cookies. As you have tried before, the user agent seems like not enough. Then it will be the cookie that will play the trick.

enter image description here

As you can see, the first page call actually returns temporarily unavailable and you need to click the rss HTML　that with 200 return code. You just need to copy the user-agent and cookies from there and it will work.

enter image description here

Here are the codes how to add cookie using urllib2

import urllib2, bs4, re

opener = urllib2.build_opener()
opener.addheaders = [("User-Agent","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.57 Safari/537.36")]
# I omitted the cookie here and you need to copy and paste your own
opener.addheaders.append(('Cookie', 'act-bg-i...eat_uuniq=1; criteo=; pl=true'))
soup = bs4.BeautifulSoup(opener.open("http://www.heraldsun.com.au/help/rss"))
div = soup.find('div', {"id":"content-2"}).find('div', {"class":"group-content"})

for a in div.find_all('a'):
    try:
        if 'feeds.news' in a['href']:
            print a 
    except:
        pass

And here are the outputs:

<a href="http://feeds.news.com.au/heraldsun/rss/heraldsun_news_breakingnews_2800.xml">Breaking News</a>
<a href="http://feeds.news.com.au/heraldsun/rss/heraldsun_news_topstories_2803.xml">Top Stories</a>
<a href="http://feeds.news.com.au/heraldsun/rss/heraldsun_news_worldnews_2793.xml">World News</a>
<a href="http://feeds.news.com.au/heraldsun/rss/heraldsun_news_morenews_2794.xml">Victoria and National News</a>
<a href="http://feeds.news.com.au/heraldsun/rss/heraldsun_news_sport_2789.xml">Sport News</a>
...

score 0 · Answer 2 · edited May 23 '17 at 10:25

0

The site could very likely be serving different content, depending on the User-Agent string in the headers. Websites will often do this for mobile browsers, for example.

Since you're not specifying one, urllib is going to use its default:

By default, the URLopener class sends a User-Agent header of urllib/VVV, where VVV is the urllib version number.

You could try spoofing a common User-Agent string, by following the advice in this question. See What's My User Agent?

edited May 23 '17 at 10:25

Community

1
1

answered Dec 05 '13 at 00:18

Jonathon Reinhart

132,704
33
254
328

in fact, I did use the User Agent, let me edit the question, as when I copy the code, I remove that part. – Hoang Pham Dec 05 '13 at 00:20
Sigh... and that's why it's a good reason to post your *actual* code. – Jonathon Reinhart Dec 05 '13 at 00:21
I updated the code, it still has that error. I don't know if other elements have any effects. – Hoang Pham Dec 05 '13 at 00:27

It is not possible to parse a part of a webpage that is visible when open with browser

2 Answers2