0

I am using python to parse a JSON file, I know it is because of this ¥,

that I got this error when I was using json.loads

UnicodeDecodeError: 'utf8' codec can't decode byte 0xa5 in position 106:
invalid start byte

But how do I get around it? Do I decode and encode again?

¥ is the Chinese currency sign, but I am not sure which code category it belongs to.

Thanks!

update:

====================

I think my question should be, If you see this symbol, how do you guess the encoding.

An answer to this question maybe:

If you see ¥, then "utf-8" won't work, try "latin-1" instead. Is this understanding correct?

Niebieski
  • 591
  • 1
  • 8
  • 16
  • JSON operates on utf8 by default. You have to convert the string to utf8 first. If you don't know the encoding, then there is not much you can do. – freakish May 08 '14 at 06:15
  • I tried and json.loads(contents,encoding='latin1') seems to work. But if anyone can give a more comprehensive answer, it would be really appreciated! thanks! – Niebieski May 08 '14 at 06:19
  • 2
    But what is it that you don't understand? You have a string in a different encoding, so you specify the encoding when doing `json.loads` and it works. The end of the story. – freakish May 08 '14 at 06:21
  • I figured it out a little bit later after initial posting. I guess I am curious if this can be done automatically? I guess I can write a serial of try: except: to try all the big ones. – Niebieski May 08 '14 at 06:31
  • Are you asking whether you can determine the encoding of the string? Generally you can't. Read this: http://stackoverflow.com/questions/436220/python-is-there-a-way-to-determine-the-encoding-of-text-file – freakish May 08 '14 at 06:33
  • 1
    @Niebieski You should not *guess* the encoding. The source of the data should tell you the encoding, for example using a HTTP header if you download the file from the web. – Ferdinand Beyer May 08 '14 at 06:34
  • I understand I can get the encoding if it is standard html, but this one is just a json I am reading, I guess encoding info is not provided – Niebieski May 08 '14 at 06:40
  • 2
    You *really* need to look into where you're getting that JSON from. The code generating it has a bug (JSON is UTF-8), and fixing it on the consumers' end should only be done if all else fails. – David Ehrmann Jun 02 '14 at 22:48

2 Answers2

0

The problem was solve by using the following code:

 json.loads(contents,encoding='latin1') 

I was confused about the encoding, the source did not specify it clearly.

Niebieski
  • 591
  • 1
  • 8
  • 16
  • In fact, the source was erroneous, because proper JSON is per definition UTF-8. – tripleee Dec 06 '14 at 16:00
  • For future visitors who are trying to solve this problem, perhaps it needs to be spelled out that you need to know the correct encoding. Putting "Latin-1" will absolutely get rid of the error message, but produce garbage if that is not in fact the correct encoding. – tripleee Dec 14 '19 at 09:30
0

The real answer is, in the general case, you cannot determine the encoding of an unknown piece of data.

Given context, such as English text, you can sometimes guess e.g. that c?rrupted has had "o" replaced by "?", but if you don't have that sort of context, you can't even tell which bytes are wrong.

For your specific example, you are asking it the wrong way around. If you see a yen sign, which encoding are you using to look at the data? If it's Latin-1, then you are looking at a byte value of 0xA5. This value can be looked up; you could be looking at any of v‎, ¥‎, ¸‎ , Ë‎, Í‎, Ñ‎, Ą‎, ą‎, ċ‎, Ĩ‎, Ľ‎, ź‎, Β‎, Ξ‎, ξ‎, Ѕ‎, Ц‎, е‎, Ґ‎, Ҙ‎, ح‎, ٪‎, ۴‎, ฅ‎, „‎, •‎, ₯‎, ╔‎, ﺄ‎, or a fragment out of a multi-byte encoding.

If the program or organization which produced the unknown data is available, you can talk to people and/or experiment with the software; but if an authoritative answer can't be found, you end up just guessing, or giving up.

There is a reason modern formats require a known encoding, and will reject input which clearly violates that.

tripleee
  • 175,061
  • 34
  • 275
  • 318