Why does my first element in readlines() of a CSV have additional characters?

Question

I ran the following python code to open a CSV, and the first element had some extra characters in it that aren't present when I view the CSV in a text editor, say Notepad++.

priorities_file = open('priorities.txt', 'r')
print('Name of the file: ', priorities_file.name)

p = priorities_file.readlines()
print('Read Line: %s' % (p))

The output looked like this:

Name of the file:  priorities.txt    
Read Line: ['ï»¿Autonomy\n', 'Travel\n',...

I understand the '\n' and how to remove that from each element, but I don't understand why there are the additional characters in front of the element ' Autonomy'. Can anyone tell me why this is? Bonus points for a way to remove those characters which I honestly couldn't find how to reproduce.

https://stackoverflow.com/questions/20848761/extra-characters-in-readlines-and-join-python-how-to-remove-%C3%AF-byte-order-m . looks like a good discussion — derelict, Apr 10 '18 at 22:51
If this is only on the first, line, it's an exact dup of that question (although I'm pretty sure we have a better one, with an actual answer). If it's on each line, the right answer is a bit more complicated. Either way, the _ideal_ solution is to change the way the CSV file is created to not use spurious BOMs in the first place. Are you creating the file, or is it something given to you that you have no control over? — abarnert, Apr 10 '18 at 22:57
https://stackoverflow.com/questions/13590749/reading-unicode-file-data-with-bom-chars-in-python/13591421 — anon01, Apr 10 '18 at 23:00
On the duplicate question, you probably want to use [lightswitch05's answer](https://stackoverflow.com/a/44573867/908494), not the more complicated accepted one. — abarnert, Apr 10 '18 at 23:05
This particular file was created using Excel, so perhaps that has something to do with it. I could honestly stand to learn much from how encoding works. I couldn't tell you the difference between UTF-8 or UTF-16. — Magical Orange, Apr 11 '18 at 00:37

eatmeimadanish · Answer 1 · 2018-04-10T23:11:51.447

-1

repr() would help. (on Python 3.X; use ascii() instead).

p = priorities_file.readlines()
print(repr(p))

My hunch is that the ecnoding in the csv file is not actually ASCII or UTF8?

UPDATE:

This should do the trick:

p = p.decode("utf-8-sig")

edited Apr 10 '18 at 23:11

answered Apr 10 '18 at 22:57

eatmeimadanish

3,809
1
14
20

First, this is a comment, not an answer. Second, that's a UTF-8 BOM, so the rest of the file almost certainly is UTF-8. – abarnert Apr 10 '18 at 22:58
Thanks for policing the responses. Sorry for trying to offer assistance, I will cease and desists immediately as to not cause anymore harm. – eatmeimadanish Apr 10 '18 at 23:06
I created the file using Excel, so I'm guessing Excel encodes it in a particular way that I'm blatantly ignorant of. :) – Magical Orange Apr 11 '18 at 00:38
p = p.decode("utf-8-sig") will solve his problem... so it is in fact an answer... And my suggestion that it was not decoded successfully was factual. It was just stated in a question form as a conversational element. In any case, my solution solves the question in the post. – eatmeimadanish Apr 11 '18 at 16:33

Why does my first element in readlines() of a CSV have additional characters?

1 Answers1