issues with encoding in python3 and urllib3

Question

I'm trying to write a python program which will help me to automatically get some news from different websites. At the moment I'm using python3 with beautifulsoup4 and urllib3 to get the remote page and parse it.

the problem comes out when I'm trying to read text from this pages because they contain non ascii characters such as À à é ó...and so on...

I've tried to decode the page from utf-8 just after retrieving it to put it in a variable and then write it in a file without success... and even after reading different way to approach this problem I couldn't figure out a working solution.

I was wondering then if anyone of you is been in my same situation..

Here is my code

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import urllib3

http = urllib3.PoolManager()
req = http.request('GET', 'http://www.....')
page = req.data.decode('utf-8')
soup = BeautifulSoup(page)

elements = soup.find_all('div', class_='content')

fp = open('results.xml', 'a')

for element in elements:
  link  = element.find('a')
  descr = element.find('div', class_='description')

  v_link  = u'%s' % link.get('href')
  v_description = u'%s' % descr.text

  xml = "<news>\n"
  xml = xml+ "  <description>"+ v_description+ "</description>\n"
  xml = xml+ "  <page_link>"+ v_link+ "</page_link>\n"
  xml = xml+ "</news>\n"

  fp.write(xml+ '\n')

#END FOR LOOP

fp.close()

"the problem comes out" What is the problem precisely? Is there an error? What is it? On which line? — shazow, Nov 09 '14 at 23:04

score 0 · Answer 1 · answered Nov 09 '14 at 23:05

Just encode your string and write to the file, something like this:

desc = 'À à é ó...and so on...'.encode('utf-8')
with open('utf8.xml', 'a') as f:
    f.write(desc)

cat utf8.xml
À à é ó...and so on...

SO, in your case perhaps you need to change:

fp.write(xml+ '\n')

to this:

fp.write(xml.encode('utf-8') + '\n')

score 0 · Answer 2 · edited May 23 '17 at 10:26

Without examples, it's hard to say. It sounds like you're decoding non-UTF8 text (perhaps it's ISO-8859-1), or that BS is re-decoding it based on the document's metadata (or guesswork).

A few unrelated tips for that code:

Be careful writing XML using plain strings. You should be escaping it at the very least (If v_description or v_link contain a >, <,& etc you'll be creating invalid XML). Better still - build the XML programatically (see: Best way to generate xml?)
In newer Python you can use the with construct to ensure your file is closed (automatically).
Don't use + to construct strings in Python - use templating e.g. using string.Formatter. It's faster and more readable.

issues with encoding in python3 and urllib3

2 Answers2