'utf-8' codec can't decode byte 0xfe in position 0: invalid start byte

Question

I'm trying to make a product review analyzer with Python. I built a dataset with Excel with two columns containing positive and negative feedback adjectives. The program should then analyze the review and check the text's negative and positive feedback numbers with a for loop.

import numpy as np
import pandas as pd

data = pd.read_csv("data.csv")

str = "some string"

numbers = []
positives = []
negatives = []

def wordCount(word):
    avoided = word.split()
    print("There are", len(avoided), "words in this string")
    for i in range(len(avoided)):
        numbers.append(avoided.count(avoided[i]))
        if avoided[i] in data["Positive"]:
            positives.append(avoided[i])
        elif avoided[i] in data["Negative"]:
            negatives.append(avoided[i])
    print(positives, negatives)
    print(numbers)
    print(avoided[numbers.index(np.max(numbers))], np.max(numbers))

wordCount(str)

But unfortunately, when I try to get each column of the dataset, an error occurs:

'utf-8' codec can't decode byte 0xfe in position 0: invalid start byte

I tried encoding and decoding the dataset and tried converting it into a list. None of them worked, and the program pursued on giving me the same error again.

Is it because I import the dataset the wrong way? Is something wrong with my code?

Can someone please help me how to solve it?

Have you tried to specifiy the encoding of your .`csv` ? You can do it like this `data = pd.read_csv("data.csv", encoding='ansi')`. You can check the right encoding by opening your `.csv` file in `Notepad++`. Read more about encodings here https://docs.python.org/3/library/codecs.html#standard-encodings — Timeless, Sep 21 '22 at 19:29
It encoded the CSV file. Thank you very much for your help:) But now it says "KeyError: 'Positive'". I'm currently researching that error and I hope can fix that too soon. — memos815, Sep 22 '22 at 13:12
Make sure that `Positive` is a column in your dataframe `data`. Run this : `data.columns`. — Timeless, Sep 22 '22 at 13:20
I fixed that too but now I get the "ValueError: zero-size array to reduction operation maximum which has no identity" error. — memos815, Sep 22 '22 at 15:05
Can you share a sample of your dataset and the expected output ? — Timeless, Sep 22 '22 at 15:06
Sure. positives, negatives Adaptable, Abrasive Adventurous, Apathetic Amazing, Controlling Amiable, Dishonest Beautiful, Impatient Becoming, Anxious Beloved, Betrayed Blessed, Disappointed Blissful, Embarrassed Brotherly, Jealous Calming, Abysmal... It continues like this. The output I want to have is the number of positive and negative words in the string I passed as "word". — memos815, Sep 23 '22 at 16:28
Please edit your question so it contains a [reproducible example](https://stackoverflow.com/help/minimal-reproducible-example). In particular, please see [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples/20159305#20159305). — Jasmijn, Sep 26 '22 at 06:17

score 0 · Answer 1 · answered Sep 26 '22 at 06:04

Welcome to Stack Overflow!

Character in question is ~ (tilde), which means that issue is in your file not in encoding or decoding since code for tilde for Unicode or ASCII pretty similar.

However this is a bit complicated, since reading/writing a file is part of serialization. This means that there are file handlers that pass the whole file into memory and then read it as a list.

This is done by file separators (FS) (often part of header/footer binary representation of file) that are platform specific and tend to notify program where file starts and how long (in bytes) it is. Each file then consists of a block that is read, and each file type has its own block size.

However, block size is determined by file encoding, since different kind of encodings tend to have different byte size (utf8 is 1 byte, utf16 is 2 bytes and etc).

What you most likely got was UnocdeEncodeError which in this case, regarding blocks of data, is equal to IndexError within block because block size for encoding has found a character on index 0 (tilde) that has no meaning within Encoding Context.

Now issues that resulted in this error can be vast, from wrong encoding to corrupted file, difference between *.csv file and how Excel writes *.csv files... or (most probably) in this case overwriting str() function on line 6.

There is no fault on using external libraries when doing quick work, however coders should know how to write their own custom file reader for this reason alone. It helps troubleshoot issues within other libraries and knowing what to touch and what can be changed.

It is quite possible that numpy or pandas use str() function within their code (as it is bread and butter of python) and you overwriting it has generated Undefined Behavior. It is possible that when either of those libraries call on str() it calls upon some method from your file which reads specific part of the file as a binary block which would justify whole IndexError/UnicodeEncodeError issue.

This, however, doesn't have to be solution to your issue - but that is as far as I can go without looking at file, looking at external libraries code and retesting serialization or how libraries handle files in your specific platform. Just to preface here, I can't do that due to this being virtual setting.

For more information about *.csv file readers and writers you can check this link.

Cheers <3

`ord('~') == 0x7E`, not `0xFE`. Also, while overriding builtins is a bad idea, globals in Python are module local, so doing `str = ...` in your code will not affect numpy or pandas. — Jasmijn, Sep 26 '22 at 06:12

'utf-8' codec can't decode byte 0xfe in position 0: invalid start byte

1 Answers1