UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a in position 12

Question

I'm developing a chatbot with the chatterbot library. The chatbot is in my native language --> Slovene, which has a lot of strange characters (for example: š, č, ž). I'm using python 2.7.

When I try to train the bot, the library has trouble with the characters mentioned above. For example, when I run the following code:

chatBot.set_trainer(ListTrainer)
chatBot.train([
            "Koliko imam še dopusta?",
            "Letos imate še 19 dni dopusta.",
        ])

it throws the following error:

UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a in position 12: invalid start byte

I added the # -*- coding: utf-8 -*- line to the top of my file, I also changed the encoding of all used files via my editor (Sublime text 3) to utf-8, I changed the system default encoding with the following code:

import sys
reload(sys)
sys.setdefaultencoding('utf8')

The strings are of type unicode.

When I try to get a response, with these strange characters, it works, it has no issues with them. For example, running the following code in the same execution as the above training code(when I change 'š' to 's' and 'č' to 'c', in the train strings), throws no errors:

chatBot.set_trainer(ListTrainer)
chatBot.train([
            "Koliko imam se dopusta?",
            "Letos imate se 19 dni dopusta.",
        ])    
chatBot.get_response("Koliko imam še dopusta?")

I can't find a solution to this issue. Any suggestions? Thanks loads in advance. :)

EDIT: I used from __future__ import unicode_literals, to make strings of type unicode. I also checked if they really were unicode with the method type(myString)

I would also like to paste this link.

EDIT 2: @MallikarjunaraoKosuri - s code works, but in my case, I had one more thing inside the chatbot instance intialization, which is the following:

chatBot = ChatBot(
    'Test',
    trainer='chatterbot.trainers.ListTrainer',
    storage_adapter='chatterbot.storage.JsonFileStorageAdapter'
)

This is the cause of my error. The json storage file the chatbot creates, is created in my local encoding and not in utf-8. It seems the default storage (.sqlite3), doesn't have this issue, so for now I'll just avoid the json storage. But I am still interested in finding a solution to this error.

You say the strins are of type unicode: are you using `from __future__ import unicode_literals`? Also, which line raises the decode error? Because if the strings are unicode, they shouldn't be decoded (they are all already decoded), so there shouldn't be any decode errors either. — lenz, Nov 03 '17 at 22:26
**Don't** change the default encoding. `setdefaultencoding` is disabled for a reason (libraries expect the default to be `ascii`). — Mark Tolonen, Nov 04 '17 at 01:00
`#coding` declares the encoding of your source file. Make sure you actually save your source file in the declared encoding. — Mark Tolonen, Nov 04 '17 at 01:02
@lenz yes i am using `from __future__ import unicode_literals`. The decode error is raised inside the `train("Koliko imam še dopusta?", "Letos imate še 19 dni dopusta.")` method. — matiOS, Nov 05 '17 at 07:23
@MarkTolonen, ok, noted, I will remove that from my code. I saw that in some other stackoverflow answer to a similar question, and it was marked as correct in that thread. I think it is saved as utf-8, I did that thing in sublime, which the answer below is suggesting. That's what i meant with "I also changed the encoding of all used files via my editor (Sublime text 3) to utf-8". But how do i know that after doing that my file is actualy in utf-8 encoding? When I save, it writes a status in the program footer, on where the file is saved and then in parentheses it says utf-8. — matiOS, Nov 05 '17 at 07:36
The "reload" trick is recommended, usually by newbies, and marked as correct by other newbies. It doesn't make it correct. Here's an article about it: [why-sys-setdefaultencoding-will-break-code](https://anonbadger.wordpress.com/2015/06/16/why-sys-setdefaultencoding-will-break-code/). — Mark Tolonen, Nov 05 '17 at 07:54

score 0 · Answer 1 · answered Nov 04 '17 at 00:47

0

The strings from your example are not of type unicode.

Otherwise Python would not throw the UnicodeDecodeError.
This type of error says that at a certain step of program's execution Python tries to decode byte-string into unicode but for some reason fails.

In your case the reason is that:

decoding is configured by utf-8

your source file is not in utf-8 and almost certainly in cp1252:

import unicodedata

b = '\x9a'

# u = b.decode('utf-8') # UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a 
                        # in position 0: invalid start byte

u = b.decode('cp1252')

print unicodedata.name(u) # LATIN SMALL LETTER S WITH CARON
print u # š

So, the 0x9a byte from your cp1252 source can't be decoded with utf-8.

The best solution is to do nothing except convertation your source to utf-8.
With Sublime Text 3 you can easily do it by: File -> Reopen with Encoding -> UTF-8.
But don't forget to Ctrl+C your source code before the convertation beacuse just after that all your š, č, ž chars wil be replaced with ?.

answered Nov 04 '17 at 00:47

MaximTitarenko

886
4
8

I already tried this, that's what i meant with: "I also changed the encoding of all used files via my editor (Sublime text 3) to utf-8". But I will play around with this a bit more, i feel there could be something in this direction. It is of type unicode, or at least that is what the `type(myString)` method is telling me. – matiOS Nov 05 '17 at 07:41
@matiOS, there are can be unicode strings in your code, like some `myString`, but, for example, the string `"Koliko imam še dopusta?"`, which is inside your `chatBot.train()` is not a unicode in Python 2.7. It's a regular byte-string. And it looks like this byte-string causes the problem. Also, mentioning the `0x9a` byte (which is `š` in `cp1252`) in the `UnicodeDecodeError` insistently hints that the source should be in `cp1252`. How actually did you change the encoding of your source in Sublime? – MaximTitarenko Nov 05 '17 at 08:12
I saved the mentioned string `"Koliko imam še dopusta?"` inside the `myString` variable, and then ran the `type(myString)` code, the result was **unicode**. I get what you are trying to tell me tho, it seems, the answer is in this direction. I changed the encoding in the exact way you suggested. – matiOS Nov 05 '17 at 08:22
@matiOS, in Python 2.7: `myString = "Koliko imam še dopusta?"; print type(myString)` always gives ``, which is a byte string, not unicode. Try to check the file's encoding by 1) CTRL+` -> running the Sublime's console 2) execution `view.encoding()` in the console – MaximTitarenko Nov 05 '17 at 08:31
You are correct, but in my case i also have the following included: `from __future__ import unicode_literals`, which then results in ``, I know i should have added this in the original question, did edit it in later, probably after you had seen the post. I tried this, the result: `>>> view.encoding() 'UTF-8'` – matiOS Nov 05 '17 at 08:39
1

@matiOS, I recreated your example - with the same file content, `# coding: utf-8 ` at the top and `from __future__ import unicode_literals`. If the source is in `cp1252` it throws `SyntaxError: (unicode error) 'utf8' codec can't decode byte 0x9a ...`. If I convert the source to `utf-8` everything works fine. – MaximTitarenko Nov 05 '17 at 09:25
I found the problem!! :) My code was all fine, and the sources were in the right encoding, except for the json database where the chatbot stored its data. I don't know how to change in what encoding a library creates a file, so i guess i just won't use `storage_adapter="chatterbot.storage.JsonFileStorageAdapter"`, when initialising a new chatbot instance and just go with the default .sqlite3 storage. You pointed in the right direction. – matiOS Nov 06 '17 at 13:40

score 0 · Answer 2 · answered Nov 06 '17 at 05:47

Some of our friends are already suggested good part solutions, However again I would like combine all the solutions into one.

And author @gunthercox suggested some guidelines are described here http://chatterbot.readthedocs.io/en/stable/encoding.html#how-do-i-fix-python-encoding-errors

# -*- coding: utf-8 -*-
from chatterbot import ChatBot

# Create a new chat bot named Test
chatBot = ChatBot(
    'Test',
    trainer='chatterbot.trainers.ListTrainer'
)

chatBot.train([
    "Koliko imam še dopusta?",
    "Letos imate še 19 dni dopusta.",
])

Python Terminal

>>> # -*- coding: utf-8 -*-
... from chatterbot import ChatBot
>>> 
>>> # Create a new chat bot named Test
... chatBot = ChatBot(
...     'Test',
...     trainer='chatterbot.trainers.ListTrainer'
... )
>>> 
>>> chatBot.train([
...     "Koliko imam še dopusta?",
...     "Letos imate še 19 dni dopusta.",
... ])
List Trainer: [####################] 100%
>>>

You are right, running the above code works for me too. But there is one little difference between this code, and my original code. I used the following line when initialising the chatbot instance `storage_adapter="chatterbot.storage.JsonFileStorageAdapter"`, which writes in to the specified database in my local encoding and not in utf-8. Any suggestions on how to fix this problem? I will edit my original question providing the newly learned information. — matiOS, Nov 06 '17 at 13:49
It is going to be remove from production, https://github.com/gunthercox/ChatterBot/issues/473#issuecomment-265726313. I would suggest to you use Default storage adapter ``sqlite3`` — Mallikarjunarao Kosuri, Nov 07 '17 at 06:18

UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a in position 12

2 Answers2