How do I find where in CSV file does error occur when importing using Pandas?

Question

Let's say I try to import CSV file using pd.read_csv() and get this error.

'utf-8' codec can't decode byte 0x93 in position 214567: invalid start byte

How do interpret the error message and find in the CSV file what character is causing the issue? Is it the 214567th character and if so how do I find it via notepad or excel or something?

The underlying problem is that Python is trying to read the file using UTF-8 encoding (the default) but in fact the file uses some other encoding, which you will need to specify. Based on the usage of 0x93 in common encodings (https://tripleee.github.io/8bit/#93) I would bet you have a left curly double quote, but it could be one of many encodings. This question - https://stackoverflow.com/questions/269060/is-there-a-python-library-function-which-attempts-to-guess-the-character-encodin - suggest using the `chardet` module to help you find the right encoding if you don't know it — slothrop, May 02 '23 at 17:49
I understand the problem, but that doesn't answer my question. Of how do I manually find the characters in the file. That's why I asked the question because the previously asked questions are of no help to my specific question. — we_are_all_in_this_together, May 02 '23 at 18:11
@stefan_aus_hannover Hmm, it says it's not finding anything... I tried searching for both \x93 and \x{93} in regular expression mode... — we_are_all_in_this_together, May 02 '23 at 18:15

score 0 · Answer 1 · answered May 02 '23 at 18:28

0

Use this website: http://www.alanwood.net/demos/ansi.html

Find ANSI Hex number equal to 0x93. Open the CSV in Excel. Search for the corresponding Character. Still not sure how to locate the character number 214567 tho...

answered May 02 '23 at 18:28

we_are_all_in_this_together

509
2
4
12

score 0 · Answer 2 · answered May 02 '23 at 18:53

0

I found 0x93 or “ by searching like this in Notepad++

\x93

answered May 02 '23 at 18:53

stefan_aus_hannover

1,777
12
13

Mark Tolonen · Answer 3 · 2023-05-03T06:39:44.803

Using Python only, you can read the file in binary mode up to (or a little past) offset 214,567 and dump the end of the data.

>>> data = open('input.csv','rb').read(214570)   # Just a large file I had
>>> data[-5:]   # dump the last 5 bytes
b'this\x00'
>>> data[-5:].hex(' ')  # as a hexadecimal dump (the 69 is at offset 214,567)
'74 68 69 73 00'
>>> hex(data[214567])
'0x69'
>>> hex(214567)  # hexadecimal offset
'0x34627'

Or use a hexdump utility. A good one will have a "goto offset" capability.

image of byte 69 at offset 0x34627

Your file should have a 93 at that location.

How do I find where in CSV file does error occur when importing using Pandas?

3 Answers3