-1

I am trying to read a CSV file from Google Drive with Pandas library. However, I get the following error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xae in position 11: invalid start byte

I downloaded the data successfully and stored it in "/content/data" (=working directory).

To read the data I do the following:

file = os.path.join(os.getcwd(), 'file.txt') 
# /content/data/file.txt

df = pd.read_csv(file1, delimiter='\t')

And that's where I get the error. What is the problem here?

I already tried the proposed solutions here: UnicodeDecodeError: 'utf8' codec can't decode byte 0x9c. However, I still get the same error.

Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
bananabread
  • 1
  • 1
  • 3
  • 1
    It seems that the file is not encoded with "UTF-8". You have to find out the right encoding, e. g. with an advanced text editor which can show and change the encoding. – Michael Butscher Mar 26 '23 at 14:02
  • You should be able to duplicate without pandas: `open(file1).read()`. if not, then `sys.stdin.encoding` would be a good guess on its encoding. – tdelaney Mar 26 '23 at 14:08
  • @MichaelButscher It is UTF-8 encoding. I checked. – bananabread Mar 26 '23 at 14:59
  • @tdelaney I get the same error message with open(file1).read(). – bananabread Mar 26 '23 at 15:00
  • You added your question to review following the last edit but all you added to the question is *"I tried the solutions in that question and it didn't work"* but you didn't even showed ***what*** you tried and you ***never*** posted a [mre] of the file giving you this error. How can anyone help you except pointing to that duplicate quesiton? – Tomerikoo Mar 28 '23 at 14:28

1 Answers1

-2

use encoding = 'unicode_escape',

file = os.path.join(os.getcwd(), 'file.txt') 
# /content/data/file.txt

df = pd.read_csv(file1, encoding= 'unicode_escape')
Dump Eldor
  • 92
  • 1
  • 11
  • 1
    Why unicode_escape specifically? Given that the bad character isn't in the ascii range, we are pretty much guaranteed its not unicode_escape. – tdelaney Mar 26 '23 at 14:07
  • I tried this already and I get the following error: UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 6432-6433: malformed \N character escape – bananabread Mar 26 '23 at 15:01