I'm trying to write a python program which will help me to automatically get some news from different websites. At the moment I'm using python3 with beautifulsoup4 and urllib3 to get the remote page and parse it.
the problem comes out when I'm trying to read text from this pages because they contain non ascii characters such as À à é ó...and so on...
I've tried to decode the page from utf-8 just after retrieving it to put it in a variable and then write it in a file without success... and even after reading different way to approach this problem I couldn't figure out a working solution.
I was wondering then if anyone of you is been in my same situation..
Here is my code
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import urllib3
http = urllib3.PoolManager()
req = http.request('GET', 'http://www.....')
page = req.data.decode('utf-8')
soup = BeautifulSoup(page)
elements = soup.find_all('div', class_='content')
fp = open('results.xml', 'a')
for element in elements:
link = element.find('a')
descr = element.find('div', class_='description')
v_link = u'%s' % link.get('href')
v_description = u'%s' % descr.text
xml = "<news>\n"
xml = xml+ " <description>"+ v_description+ "</description>\n"
xml = xml+ " <page_link>"+ v_link+ "</page_link>\n"
xml = xml+ "</news>\n"
fp.write(xml+ '\n')
#END FOR LOOP
fp.close()