I am writing program, which collects data (title,author,article) from web page with news article. I use Readability Python library. My problem is that content(which programm) of article (if article is written in cyrillic, if article is written in latin it's OK) has such format:
{'atricle': u'<div><div class="b-text clearfix">\n<p class="b- topic__announce">'С';'о';'р';'о';'к'; 'о';'д';'и';'н'; 'п';'р';'о';'ц';'е';'н';'т'; 'р';'о';'с';'с';'и';'я';'н';'C'....
without " ' "
How to make it readable?
Asked
Active
Viewed 56 times
0

user3363858
- 1
- 1
- 1
-
can u use third party tools... if yes.. please verify beautiful soup – sundar nataraj Apr 17 '14 at 05:46
-
Is it cyrillic as in Unicode or cyrillic as in KOI-8/CP1251/etc? – user58697 Apr 17 '14 at 07:29
-
I have used this in the past to detect encoding and convert properly to the encoding I expected. http://stackoverflow.com/questions/196345/how-to-check-if-a-string-in-python-is-in-ascii#6988354 – Juanmi Taboada Apr 17 '14 at 07:53
-
http://stackoverflow.com/questions/57708/convert-xml-html-entities-into-unicode-string-in-python – alex vasi Apr 17 '14 at 08:42