3

I'm developing a chatbot with the chatterbot library. The chatbot is in my native language --> Slovene, which has a lot of strange characters (for example: š, č, ž). I'm using python 2.7.

When I try to train the bot, the library has trouble with the characters mentioned above. For example, when I run the following code:

chatBot.set_trainer(ListTrainer)
chatBot.train([
            "Koliko imam še dopusta?",
            "Letos imate še 19 dni dopusta.",
        ])

it throws the following error:

UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a in position 12: invalid start byte

I added the # -*- coding: utf-8 -*- line to the top of my file, I also changed the encoding of all used files via my editor (Sublime text 3) to utf-8, I changed the system default encoding with the following code:

import sys
reload(sys)
sys.setdefaultencoding('utf8')

The strings are of type unicode.

When I try to get a response, with these strange characters, it works, it has no issues with them. For example, running the following code in the same execution as the above training code(when I change 'š' to 's' and 'č' to 'c', in the train strings), throws no errors:

chatBot.set_trainer(ListTrainer)
chatBot.train([
            "Koliko imam se dopusta?",
            "Letos imate se 19 dni dopusta.",
        ])    
chatBot.get_response("Koliko imam še dopusta?")

I can't find a solution to this issue. Any suggestions? Thanks loads in advance. :)

EDIT: I used from __future__ import unicode_literals, to make strings of type unicode. I also checked if they really were unicode with the method type(myString)

I would also like to paste this link.

EDIT 2: @MallikarjunaraoKosuri - s code works, but in my case, I had one more thing inside the chatbot instance intialization, which is the following:

chatBot = ChatBot(
    'Test',
    trainer='chatterbot.trainers.ListTrainer',
    storage_adapter='chatterbot.storage.JsonFileStorageAdapter'
)

This is the cause of my error. The json storage file the chatbot creates, is created in my local encoding and not in utf-8. It seems the default storage (.sqlite3), doesn't have this issue, so for now I'll just avoid the json storage. But I am still interested in finding a solution to this error.

matiOS
  • 31
  • 1
  • 3
  • You say the strins are of type unicode: are you using `from __future__ import unicode_literals`? Also, which line raises the decode error? Because if the strings are unicode, they shouldn't be decoded (they are all already decoded), so there shouldn't be any decode errors either. – lenz Nov 03 '17 at 22:26
  • **Don't** change the default encoding. `setdefaultencoding` is disabled for a reason (libraries expect the default to be `ascii`). – Mark Tolonen Nov 04 '17 at 01:00
  • 1
    `#coding` declares the encoding of your source file. Make sure you actually save your source file in the declared encoding. – Mark Tolonen Nov 04 '17 at 01:02
  • @lenz yes i am using `from __future__ import unicode_literals`. The decode error is raised inside the `train("Koliko imam še dopusta?", "Letos imate še 19 dni dopusta.")` method. – matiOS Nov 05 '17 at 07:23
  • @MarkTolonen, ok, noted, I will remove that from my code. I saw that in some other stackoverflow answer to a similar question, and it was marked as correct in that thread. I think it is saved as utf-8, I did that thing in sublime, which the answer below is suggesting. That's what i meant with "I also changed the encoding of all used files via my editor (Sublime text 3) to utf-8". But how do i know that after doing that my file is actualy in utf-8 encoding? When I save, it writes a status in the program footer, on where the file is saved and then in parentheses it says utf-8. – matiOS Nov 05 '17 at 07:36
  • The "reload" trick is recommended, usually by newbies, and marked as correct by other newbies. It doesn't make it correct. Here's an article about it: [why-sys-setdefaultencoding-will-break-code](https://anonbadger.wordpress.com/2015/06/16/why-sys-setdefaultencoding-will-break-code/). – Mark Tolonen Nov 05 '17 at 07:54

2 Answers2

0

The strings from your example are not of type unicode.

Otherwise Python would not throw the UnicodeDecodeError.
This type of error says that at a certain step of program's execution Python tries to decode byte-string into unicode but for some reason fails.

In your case the reason is that:

  • decoding is configured by utf-8
  • your source file is not in utf-8 and almost certainly in cp1252:
    import unicodedata
    
    b = '\x9a'
    
    # u = b.decode('utf-8') # UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a 
                            # in position 0: invalid start byte
    
    u = b.decode('cp1252')
    
    print unicodedata.name(u) # LATIN SMALL LETTER S WITH CARON
    print u # š
    

    So, the 0x9a byte from your cp1252 source can't be decoded with utf-8.


    The best solution is to do nothing except convertation your source to utf-8.
    With Sublime Text 3 you can easily do it by: File -> Reopen with Encoding -> UTF-8.
    But don't forget to Ctrl+C your source code before the convertation beacuse just after that all your š, č, ž chars wil be replaced with ?.

  • MaximTitarenko
    • 886
    • 4
    • 8
    • I already tried this, that's what i meant with: "I also changed the encoding of all used files via my editor (Sublime text 3) to utf-8". But I will play around with this a bit more, i feel there could be something in this direction. It is of type unicode, or at least that is what the `type(myString)` method is telling me. – matiOS Nov 05 '17 at 07:41
    • @matiOS, there are can be unicode strings in your code, like some `myString`, but, for example, the string `"Koliko imam še dopusta?"`, which is inside your `chatBot.train()` is not a unicode in Python 2.7. It's a regular byte-string. And it looks like this byte-string causes the problem. Also, mentioning the `0x9a` byte (which is `š` in `cp1252`) in the `UnicodeDecodeError` insistently hints that the source should be in `cp1252`. How actually did you change the encoding of your source in Sublime? – MaximTitarenko Nov 05 '17 at 08:12
    • I saved the mentioned string `"Koliko imam še dopusta?"` inside the `myString` variable, and then ran the `type(myString)` code, the result was **unicode**. I get what you are trying to tell me tho, it seems, the answer is in this direction. I changed the encoding in the exact way you suggested. – matiOS Nov 05 '17 at 08:22
    • @matiOS, in Python 2.7: `myString = "Koliko imam še dopusta?"; print type(myString)` always gives ``, which is a byte string, not unicode. Try to check the file's encoding by 1) CTRL+` -> running the Sublime's console 2) execution `view.encoding()` in the console – MaximTitarenko Nov 05 '17 at 08:31
    • You are correct, but in my case i also have the following included: `from __future__ import unicode_literals`, which then results in ``, I know i should have added this in the original question, did edit it in later, probably after you had seen the post. I tried this, the result: `>>> view.encoding() 'UTF-8'` – matiOS Nov 05 '17 at 08:39
    • 1
      @matiOS, I recreated your example - with the same file content, `# coding: utf-8 ` at the top and `from __future__ import unicode_literals`. If the source is in `cp1252` it throws `SyntaxError: (unicode error) 'utf8' codec can't decode byte 0x9a ...`. If I convert the source to `utf-8` everything works fine. – MaximTitarenko Nov 05 '17 at 09:25
    • I found the problem!! :) My code was all fine, and the sources were in the right encoding, except for the json database where the chatbot stored its data. I don't know how to change in what encoding a library creates a file, so i guess i just won't use `storage_adapter="chatterbot.storage.JsonFileStorageAdapter"`, when initialising a new chatbot instance and just go with the default .sqlite3 storage. You pointed in the right direction. – matiOS Nov 06 '17 at 13:40
    0

    Some of our friends are already suggested good part solutions, However again I would like combine all the solutions into one.

    And author @gunthercox suggested some guidelines are described here http://chatterbot.readthedocs.io/en/stable/encoding.html#how-do-i-fix-python-encoding-errors

    # -*- coding: utf-8 -*-
    from chatterbot import ChatBot
    
    # Create a new chat bot named Test
    chatBot = ChatBot(
        'Test',
        trainer='chatterbot.trainers.ListTrainer'
    )
    
    chatBot.train([
        "Koliko imam še dopusta?",
        "Letos imate še 19 dni dopusta.",
    ])
    

    Python Terminal

    >>> # -*- coding: utf-8 -*-
    ... from chatterbot import ChatBot
    >>> 
    >>> # Create a new chat bot named Test
    ... chatBot = ChatBot(
    ...     'Test',
    ...     trainer='chatterbot.trainers.ListTrainer'
    ... )
    >>> 
    >>> chatBot.train([
    ...     "Koliko imam še dopusta?",
    ...     "Letos imate še 19 dni dopusta.",
    ... ])
    List Trainer: [####################] 100%
    >>> 
    
    Mallikarjunarao Kosuri
    • 1,023
    • 7
    • 25
    • 52
    • You are right, running the above code works for me too. But there is one little difference between this code, and my original code. I used the following line when initialising the chatbot instance `storage_adapter="chatterbot.storage.JsonFileStorageAdapter"`, which writes in to the specified database in my local encoding and not in utf-8. Any suggestions on how to fix this problem? I will edit my original question providing the newly learned information. – matiOS Nov 06 '17 at 13:49
    • It is going to be remove from production, https://github.com/gunthercox/ChatterBot/issues/473#issuecomment-265726313. I would suggest to you use Default storage adapter ``sqlite3`` – Mallikarjunarao Kosuri Nov 07 '17 at 06:18