-1

I am reading a page source from a webpage, then parsing a value from that source. There I am facing a problem with special characters.

In my python controller file iam using # -*- coding: utf-8 -*-. But I am reading a webpage source which is using charset=iso-8859-1

So when I read the page content without specifying any encoding it is throwing error as UnicodeDecodeError: 'utf8' codec can't decode byte 0xfc in position 133: invalid start byte

when I use string.decode("iso-8859-1").encode("utf-8") then it is parsing data without any error. But it is displaying the value as 'F\u00fcnke' instead of 'Fünke'.

Please let me know how I can solve this issue. I would greatly appreciate any suggestions.

Pradeeshnarayan
  • 1,235
  • 10
  • 21

1 Answers1

0

Encoding is a PITA in Python3 for sure (and 2 in some cases as well). Try checking these links out, they might help you:

Python - Encoding string - Swedish Letters
Python3 - ascii/utf-8/iso-8859-1 can't decode byte 0xe5 (Swedish characters)

http://docs.python.org/2/library/codecs.html

Also it would be nice with the code for "So when I read the page content without specifying any encoding" My best guess is that your console doesn't use utf-8 (for instance, windows.. your # -*- coding: utf-8 -*- only tells Python what type of characters to find within the sourcecode, not the actual data the code is going to parse or analyze itself. For instance i write:

# -*- coding: iso-8859-1 -*-
import time
# Här skriver jag ut tiden (Translation: Here, i print out the time)
print(time.strftime('%H:%m:%s'))
Community
  • 1
  • 1
Torxed
  • 22,866
  • 14
  • 82
  • 131
  • And a downvote for no reason, real constructive (if i'm wrong at least point it out ffs) -.- – Torxed Aug 18 '13 at 21:45