error with unicode when trying to run python script

Question

I try to execute a python script and I get an error, saying "charmap" can't decode a byte, because character maps to undefined. I guess it has something to do with unicode, however I am not that experienced to solve the problem.

def load_imdb_sentiment_analysis_dataset(data_path = 
"C:/Users/name/Desktop", seed=123):

imdb_data_path = os.path.join(data_path, 'aclImdb')

# Load the training data
train_texts = []
train_labels = []
for category in ['pos', 'neg']:
    train_path = os.path.join(imdb_data_path, 'train', category)
    for fname in sorted(os.listdir(train_path)):
        if fname.endswith('.txt'):
            with open(os.path.join(train_path, fname)) as f:
                train_texts.append(f.read())
            train_labels.append(0 if category == 'neg' else 1)

# Load the validation data.
test_texts = []
test_labels = []
for category in ['pos', 'neg']:
    test_path = os.path.join(imdb_data_path, 'test', category)
    for fname in sorted(os.listdir(test_path)):
        if fname.endswith('.txt'):
            with open(os.path.join(test_path, fname)) as f:
                test_texts.append(f.read())
            test_labels.append(0 if category == 'neg' else 1)

# Shuffle the training data and labels.
random.seed(seed)
random.shuffle(train_texts)
random.seed(seed)
random.shuffle(train_labels)

return ((train_texts, np.array(train_labels)),
        (test_texts, np.array(test_labels)))

I get the following error: UnicodeDecodeError: 'charmap' codec can't decode byte 0xaa in position 489: character maps to

Possible duplicate of [UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to ](https://stackoverflow.com/questions/9233027/unicodedecodeerror-charmap-codec-cant-decode-byte-x-in-position-y-character) — Nishant, Jan 22 '19 at 13:04

Litvin · Accepted Answer · 2019-01-22T13:20:27.197

You need to figure out the encoding of the file you trying to open. And use it in open function.

For example for utf8: open(filename, encoding='utf8')

so you can change from with open(os.path.join(train_path, fname)) to with open(os.path.join(train_path, fname), encoding='utf8')

If you don't care about the characters that can't be open you could just skip them (be careful in such approaches): open(filename, errors='ignore')

with open(os.path.join(train_path, fname), errors='ignore')

error with unicode when trying to run python script

1 Answers1