Python webpage source read with special characters

Question

I am reading a page source from a webpage, then parsing a value from that source. There I am facing a problem with special characters.

In my python controller file iam using # -*- coding: utf-8 -*-. But I am reading a webpage source which is using charset=iso-8859-1

So when I read the page content without specifying any encoding it is throwing error as UnicodeDecodeError: 'utf8' codec can't decode byte 0xfc in position 133: invalid start byte

when I use string.decode("iso-8859-1").encode("utf-8") then it is parsing data without any error. But it is displaying the value as 'F\u00fcnke' instead of 'Fünke'.

Please let me know how I can solve this issue. I would greatly appreciate any suggestions.

try print `u"F\u00fcnke"` – Uku Loskit Aug 18 '13 at 20:55 — Uku Loskit, Aug 18 '13 at 20:55
Python 2.7. and tried unicode() it is showing the same. – Pradeeshnarayan Aug 19 '13 at 01:14 — Pradeeshnarayan, Aug 19 '13 at 01:14

score 0 · Answer 1 · edited May 23 '17 at 12:11

Encoding is a PITA in Python3 for sure (and 2 in some cases as well). Try checking these links out, they might help you:

Python - Encoding string - Swedish Letters
Python3 - ascii/utf-8/iso-8859-1 can't decode byte 0xe5 (Swedish characters)

http://docs.python.org/2/library/codecs.html

Also it would be nice with the code for "So when I read the page content without specifying any encoding" My best guess is that your console doesn't use utf-8 (for instance, windows.. your # -*- coding: utf-8 -*- only tells Python what type of characters to find within the sourcecode, not the actual data the code is going to parse or analyze itself. For instance i write:

# -*- coding: iso-8859-1 -*-
import time
# Här skriver jag ut tiden (Translation: Here, i print out the time)
print(time.strftime('%H:%m:%s'))

And a downvote for no reason, real constructive (if i'm wrong at least point it out ffs) -.- — Torxed, Aug 18 '13 at 21:45

Python webpage source read with special characters

1 Answers1