Beautiful Soup 4 Not Working/Consistent

Question

Although the script that I have written works, not all sites have their titles returned(that is what i'm going after, to get the website's title and print it back). Sites like google work, but others such as this very site, StackOverflow, generate an error.

Here is my code:

    import urllib2
    from bs4 import BeautifulSoup

    soup = BeautifulSoup(urllib2.urlopen("http://lxml.de"))
    print soup.title.string

If you could do these things for me that would be great :)

If any improvements could be made to the code(and handle variables)
How to solve the issue that it doesnt return (And handle any errors in genral)
The code actaully returns a USERWARNING(when it actually works) saying that I should add a special "html.parser" after the script but it didnt work after i put that in.

BTW, ERROR GIVEN (exactly as it spat it out):

Traceback (most recent call last):
  File "C:\Users\NAME\Desktop\NETWORK\personal work\PROGRAMMING\Python\bibli
ography PYTHON\TEMP.py", line 5, in <module>
    soup = BeautifulSoup(urllib2.urlopen("http://stackoverflow.com/questions/364
96222/beautiful-soup-4-not-working-consistent"))
  File "C:\Program Files (x86)\PYTHON 27\lib\urllib2.py", line 154, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Program Files (x86)\PYTHON 27\lib\urllib2.py", line 437, in open
    response = meth(req, response)
  File "C:\Program Files (x86)\PYTHON 27\lib\urllib2.py", line 550, in http_resp
onse
    'http', request, response, code, msg, hdrs)
  File "C:\Program Files (x86)\PYTHON 27\lib\urllib2.py", line 475, in error
    return self._call_chain(*args)
  File "C:\Program Files (x86)\PYTHON 27\lib\urllib2.py", line 409, in _call_cha
in
    result = func(*args)
  File "C:\Program Files (x86)\PYTHON 27\lib\urllib2.py", line 558, in http_erro
r_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden
Press any key to continue . . .

the error seems to be related to to the urllib you are using — jithin, Apr 08 '16 at 09:48

score 1 · Accepted Answer · edited May 23 '17 at 11:51

1

I can get this to work by specifying the user agent header. I have a feeling it has something to do with https vs http, but I'm afraid I'm not entirely sure what the reason is.

import urllib2
from bs4 import BeautifulSoup

site= "https://stackoverflow.com"
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11'}

req = urllib2.Request(site, headers=hdr)

try:
    soup = BeautifulSoup(urllib2.urlopen(req), "html.parser")
except urllib2.HTTPError, e:
    print e.fp.read()

print soup.title.string

This was influenced by this answer on another question.

edited May 23 '17 at 11:51

Community

1
1

answered Apr 08 '16 at 09:54

chrxr

1,374
1
10
21

Thanks, worked great! Just one thing, what does the "hdr" variable do? I don't quite get it. For other readers: Add the "html.parser" to remove user warning, even though it didnt work before, it does now. IDK try: soup = BeautifulSoup(urllib2.urlopen(req), "html.parser") – John Hon Apr 08 '16 at 12:57
No worries. I've added the "html.parser" bit to the answer. The "hdr" is a dictionary of HTTP headers to send with the urllib request. Depending on the target server configuration, the server may return a 403 if certain headers are not present. – chrxr Apr 08 '16 at 13:54

score 0 · Answer 2 · answered Apr 08 '16 at 09:51

0

try this url library

pip install requests

the below code works for me

import requests
from bs4 import BeautifulSoup
htmlresponse = requests.get("http://lxml.de/")
print htmlresponse.content

answered Apr 08 '16 at 09:51

jithin

1,412
2
17
27

when i tried ur's, it simply spat out ALL the html code on the page. I simply want the title, do you know of a way to do this ? :) – John Hon Apr 08 '16 at 11:07

Beautiful Soup 4 Not Working/Consistent

2 Answers2