0

This is a script that sends randomly generated sentences into a discord chat. But it occasionally runs into the error: UnicodeDecodeError: 'ascii' codec cant decode byte 0xef in position 2141: ordinal not in range(128)

How would I solve this error?

Code:

import asyncio
import random
import discord.ext.commands
import markovify
import nltk
import re


with open("/root/sample.txt") as f:
 text = f.read()

class POSifiedText(markovify.Text):
    def word_split(self, sentence):
        words = re.split(self.word_split_pattern, sentence)
        words = [w for w in words if len(w) > 0]
        words = [" :: ".join(tag) for tag in nltk.pos_tag(words)]
        return words

    def word_join(self, words):
        sentence = "".join(word.split("::")[0] for word in words)
        return sentence


text_model = POSifiedText(text, state_size=1)

client = discord.Client()
async def background_loop():
    await client.wait_until_ready()
    while not client.is_closed:
        channel = client.get_channel('286342556600762369')
        messages = [(text_model.make_sentence(tries=33, max_overlap_total=10, default_max_overlap_ratio=0.5))]
        await client.send_message(channel, random.choice(messages))
        await asyncio.sleep(15)

client.loop.create_task(background_loop())
client.run("MjY2NjkwNDY4MjI4NzU5NTU4.C5jcdw.WFfBTUmAY7UcrwKTwYFJ9_bFHjI")

The error is occurring on line 9.

alexis
  • 48,685
  • 16
  • 101
  • 161
Museman
  • 7
  • 1
  • 6
  • Your sample contains non-ascii characters. In the `open()` call, specify the correct encoding. – alexis Mar 09 '17 at 06:18
  • Duplicate of [UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to ](http://stackoverflow.com/questions/9233027/unicodedecodeerror-charmap-codec-cant-decode-byte-x-in-position-y-character) – alexis Mar 09 '17 at 08:41
  • But your question has nothing to do with the nltk, or with the chatbot you're building. Stop tagging everything `nltk`; narrow down your problem and it will be easy to google the solution. – alexis Mar 09 '17 at 08:44

1 Answers1

0

I had a similar problem; turns out there was an ascii code that python wasn't able to turn into a standard symbol. To get around this you have to tell python to encode the ascii code and ignore ones that it can't encode. then decode it back to utf-8.

    def word_split(self, sentence):
        words = re.split(self.word_split_pattern, sentence)
        words = [w for w in words if len(w) > 0]
        words = [" :: ".join(tag) for tag in nltk.pos_tag(words)]
        words = words.encode('ascii', 'ignore')
        words = words.decode("utf-8"))
        return words

I added to extra steps before your return statement.

Brady
  • 21
  • 5
  • 1
    That would be a **non-ascii** code. [Learn](http://stackoverflow.com/a/4546129/699305) about encoding/decoding so you'll understand what your code does. (Also, this won't solve the OP's problem). – alexis Mar 09 '17 at 08:45