Python 3 Beautiful Soup Web Scraping

Question

I'm currently working with BeautifulSoup. I seem to be having some issues related to encoding.

Here is my code:

import requests
from bs4 import BeautifulSoup
req = requests.get('https://pythonprogramming.net/parsememcparseface/')
soup = BeautifulSoup(req.content.decode('utf-8','ignore'))
print(soup.find_all('p'))

Here is my error:

 UnicodeEncodeError: 'ascii' codec can't encode character '\u1d90' in position 602: ordinal not in range(128)

Any help would be appreciated.

I'm sorry the link you just sent me is the link to this post. — Daniel Smith, Apr 24 '17 at 18:36
I can't reproduce any issue with your code in either Python 2 or 3. Anyway, I suggest replacing `req.content.decode('utf-8','ignore')` with `req.text`. — Alex Hall, Apr 24 '17 at 18:46
That was one of the solutions I tried. I am able to print req.content with no problem. However when i print soup.text I get the error you see above. So I can make the request but once I start working with BeautifulSoup objects I have these encoding issues. Any idea? — Daniel Smith, Apr 24 '17 at 18:50
@ Alex Hall . I just tried your suggestion. req.content works when I print. However I get the same UnicodeEncodeError when I try your suggestion of req.text. Any ideas? — Daniel Smith, Apr 24 '17 at 18:52
You should set python default encoding to utf-8 as explained http://stackoverflow.com/questions/4545661/unicodedecodeerror-when-redirecting-to-file/4546129#4546129 — Jakub Macina, Apr 24 '17 at 18:57

Dariusz · Answer 1 · 2017-04-24T19:24:52.793

0

Please add "html5lib" or "html.parser"

#!/usr/bin/python
# -*- coding: utf-8 -*-

...

# Python 3.6.0
soup = BeautifulSoup(req.content.decode('utf-8','ignore'), "html5lib")

# Python 2.7.12
soup = BeautifulSoup(req.content.decode('utf-8','ignore'), "html.parser")

edited Apr 24 '17 at 19:24

answered Apr 24 '17 at 18:48

Dariusz

1
2

Thanks for the suggestion. I tried it out but it didn't work. Same error. – Daniel Smith Apr 24 '17 at 18:57
can you give me `pip freeze`command result? python and OS version? – Dariusz Apr 24 '17 at 19:01
Python 3.6.0 . OS X Yosemite 10.10.5pyperclip==1.5.27 PyScreeze==0.1.9 PyTweening==1.0.3 pytz==2016.10 requests==2.12.5 Send2Trash==1.3.0 six==1.10.0 virtualenv==15.1.0 webencodings==0.5.1 – Daniel Smith Apr 24 '17 at 19:06
Do you think maybe webencodings has something to do with it? – Daniel Smith Apr 24 '17 at 19:08
My mistake ;) I have pip and python default set to python3 You gave me your packages from python2. You have to make `pip3 freeze` – Dariusz Apr 24 '17 at 19:13
So I just came across something interesting. Maybe this will assist in the troubleshooting. Was using sublime originally when I had this issue. I just swapped over to the python interpreter/shell and I'm not having any issue. Any reason for this discrepancy ? – Daniel Smith Apr 24 '17 at 19:16
add `# -*- coding: utf-8 -*- ` in first line your script in sublime or second if you have specify interpreter in first line: `#!/usr/bin/python` – Dariusz Apr 24 '17 at 19:18
So I think this is definitely an issue with sublime. I have sublime and sublime2 . It works partially in sublime 2 and in sublime I get the error. Very odd – Daniel Smith Apr 24 '17 at 19:44

score 0 · Answer 2 · answered Apr 24 '17 at 18:50

I tried to reproduce the issue that you are facing here but was not able to.

Here is what I tried.

>>> import requests
>>> from bs4 import BeautifulSoup

>>> req = requests.get('https://pythonprogramming.net/parsememcparseface/')

>>> soup = BeautifulSoup(req.content.decode('utf-8','ignore'))


Warning (from warnings module):
  File "C:\Python34\lib\site-packages\bs4\__init__.py", line 166
    markup_type=markup_type))
UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

To get rid of this warning, change this:

 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "html.parser")

>>> soup = BeautifulSoup(req.content.decode('utf-8','ignore'), 'html.parser')
>>> print(soup.find_all('p'))
[<p class="introduction">Oh, hello! This is a <span style="font-size:115%">wonderful</span> page meant to let you practice web scraping. This page was originally created to help people work with the <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/" target="blank"><strong>Beautiful Soup 4</strong></a> library.</p>, <p>The following table gives some general information for the following <code>programming languages</code>:</p>, <p>I think it's clear that, on a scale of 1-10, python is:</p>, <p>Javascript (dynamic data) test:</p>, <p class="jstest" id="yesnojs">y u bad tho?</p>, <p>Whᶐt hαppéns now¿</p>, <p><a href="/sitemap.xml" target="blank"><strong>sitemap</strong></a></p>, <p>
<a class="btn btn-flat white modal-close" href="#">Cancel</a>  
                        <a class="waves-effect waves-blue blue btn btn-flat modal-action modal-close" href="#">Login</a>
</p>, <p>
<a class="btn btn-flat white modal-close" href="#">Cancel</a>  
                                <button class="btn" type="submit" value="Register">Sign Up</button>
</p>, <p class="grey-text text-lighten-4">Contact: Harrison@pythonprogramming.net.</p>, <p class="grey-text right" style="padding-right:10px">Programming is a superpower.</p>]

Thanks for trying. I'm just super confused as to what the issue could be. No one else seems to be having this issue when they try the code. — Daniel Smith, Apr 24 '17 at 18:56

score 0 · Accepted Answer · answered Apr 24 '17 at 20:55

I can duplicate your error message and eliminate troublesome characters.

First this code simply requests the page and attempts to save it. The attempt fails with the message you have seen. I create a copy of the page by converting it to bytes ignoring ugly character codes and then converting it back to characters. Now the page can be saved successfully.

I make soup with it and find the paragraph tags.

>>> from bs4 import BeautifulSoup
>>> import requests
>>> req = requests.get('https://pythonprogramming.net/parsememcparseface/').text
>>> open('c:/scratch/temp.htm', 'w').write(req)
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
  File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u1d90' in position 6702: character maps to <undefined>
>>> modReq = str(req.encode('utf-8', 'ignore'))
>>> open('c:/scratch/temp.htm', 'w').write(modReq)
12556
>>> soup = BeautifulSoup(modReq, 'lxml')
>>> paras = soup.findAll('p')
>>> len(paras)
12

Thank you so much. I appreciate the help. – Daniel Smith Apr 25 '17 at 00:03 — Daniel Smith, Apr 25 '17 at 00:03

Python 3 Beautiful Soup Web Scraping

3 Answers3