Error in printing scraped webpage through bs4

Question

Code:

import requests
import urllib
from bs4 import BeautifulSoup

page1 = urllib.request.urlopen("http://en.wikipedia.org/wiki/List_of_human_stampedes")
soup = BeautifulSoup(page1)
print(soup.get_text())
print(soup.prettify())

Error:

 Traceback (most recent call last):
  File "C:\Users\sony\Desktop\Trash\Crawler Try\try2.py", line 9, in <module>
    print(soup.get_text())
  File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u014d' in position 10487: character maps to <undefined>

I think the problem lies mainly with urlib package. Here I am using urllib3 package. They changed the urlopen syntax from 2 to 3 version, which maybe the cause of error. But that being said I have included the latest syntax only. Python version 3.4

score 2 · Accepted Answer · edited May 23 '17 at 12:06

since you are importing requests you can use it instead of urllib like this:

import requests
from bs4 import BeautifulSoup

page1 = requests.get("http://en.wikipedia.org/wiki/List_of_human_stampedes")
soup = BeautifulSoup(page1.text)
print(soup.get_text())
print(soup.prettify())

Your problem is that python cannot encode the characters from the page that you are scraping. For some more information see here: https://stackoverflow.com/a/16347188/2638310

Since the wikipedia page is in UTF-8, it seems that BeautifulSoup is guessing the encoding incorrectly. Try passing the from_encoding argument in your code like this:

soup = BeautifulSoup(page1.text, from_encoding="UTF-8")

For more on encodings in BeautifulSoup have a look here: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#encodings

Gives this error File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\u014d' in position 10487: character maps to — Abhishek Bhatia, Jan 07 '15 at 11:28

Seppo · Answer 2 · 2015-01-07T12:41:49.550

0

I am using Python2.7, so I don't have request method inside the urllib module.

#!/usr/bin/python3
# coding: utf-8

import requests
from bs4 import BeautifulSoup

URL = "http://en.wikipedia.org/wiki/List_of_human_stampedes"
soup = BeautifulSoup(requests.get(URL).text)
print(soup.get_text())
print(soup.prettify())

https://www.python.org/dev/peps/pep-0263/

edited Jan 07 '15 at 12:41

answered Jan 07 '15 at 12:18

Seppo

189
4
10

score 0 · Answer 3 · edited Nov 23 '18 at 00:29

0

Put those print lines inside a Try-Catch block so if there is an illegal character, then you won't get an error.

try:
   print(soup.get_text())
   print(soup.prettify())
except Exception:
   print(str(soup.get_text().encode("utf-8")))
   print(str(soup.prettify().encode("utf-8")))

edited Nov 23 '18 at 00:29

Eric Reed

377
1
6
21

answered Jan 08 '15 at 10:34

Error in printing scraped webpage through bs4

3 Answers3