0

Although the script that I have written works, not all sites have their titles returned(that is what i'm going after, to get the website's title and print it back). Sites like google work, but others such as this very site, StackOverflow, generate an error.

Here is my code:

    import urllib2
    from bs4 import BeautifulSoup

    soup = BeautifulSoup(urllib2.urlopen("http://lxml.de"))
    print soup.title.string

If you could do these things for me that would be great :)

  1. If any improvements could be made to the code(and handle variables)
  2. How to solve the issue that it doesnt return (And handle any errors in genral)
  3. The code actaully returns a USERWARNING(when it actually works) saying that I should add a special "html.parser" after the script but it didnt work after i put that in.

BTW, ERROR GIVEN (exactly as it spat it out):

Traceback (most recent call last):
  File "C:\Users\NAME\Desktop\NETWORK\personal work\PROGRAMMING\Python\bibli
ography PYTHON\TEMP.py", line 5, in <module>
    soup = BeautifulSoup(urllib2.urlopen("http://stackoverflow.com/questions/364
96222/beautiful-soup-4-not-working-consistent"))
  File "C:\Program Files (x86)\PYTHON 27\lib\urllib2.py", line 154, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Program Files (x86)\PYTHON 27\lib\urllib2.py", line 437, in open
    response = meth(req, response)
  File "C:\Program Files (x86)\PYTHON 27\lib\urllib2.py", line 550, in http_resp
onse
    'http', request, response, code, msg, hdrs)
  File "C:\Program Files (x86)\PYTHON 27\lib\urllib2.py", line 475, in error
    return self._call_chain(*args)
  File "C:\Program Files (x86)\PYTHON 27\lib\urllib2.py", line 409, in _call_cha
in
    result = func(*args)
  File "C:\Program Files (x86)\PYTHON 27\lib\urllib2.py", line 558, in http_erro
r_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden
Press any key to continue . . .
John Hon
  • 183
  • 2
  • 10

2 Answers2

1

I can get this to work by specifying the user agent header. I have a feeling it has something to do with https vs http, but I'm afraid I'm not entirely sure what the reason is.

import urllib2
from bs4 import BeautifulSoup

site= "https://stackoverflow.com"
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11'}

req = urllib2.Request(site, headers=hdr)

try:
    soup = BeautifulSoup(urllib2.urlopen(req), "html.parser")
except urllib2.HTTPError, e:
    print e.fp.read()

print soup.title.string

This was influenced by this answer on another question.

Community
  • 1
  • 1
chrxr
  • 1,374
  • 1
  • 10
  • 21
  • Thanks, worked great! Just one thing, what does the "hdr" variable do? I don't quite get it. For other readers: Add the "html.parser" to remove user warning, even though it didnt work before, it does now. IDK try: soup = BeautifulSoup(urllib2.urlopen(req), "html.parser") – John Hon Apr 08 '16 at 12:57
  • No worries. I've added the "html.parser" bit to the answer. The "hdr" is a dictionary of HTTP headers to send with the urllib request. Depending on the target server configuration, the server may return a 403 if certain headers are not present. – chrxr Apr 08 '16 at 13:54
0

try this url library

pip install requests   

the below code works for me

import requests
from bs4 import BeautifulSoup
htmlresponse = requests.get("http://lxml.de/")
print htmlresponse.content
jithin
  • 1,412
  • 2
  • 17
  • 27
  • when i tried ur's, it simply spat out ALL the html code on the page. I simply want the title, do you know of a way to do this ? :) – John Hon Apr 08 '16 at 11:07