Why does my Python code print the extra characters "ï»¿" when reading from a text file?

Question

try:
    data=open('info.txt')
    for each_line in data:
        try:
            (role,line_spoken)=each_line.split(':',1)
            print(role,end='')
            print(' said: ',end='')
            print(line_spoken,end='')
        except ValueError:
            print(each_line)
    data.close()
except IOError:
     print("File is missing")

When printing the file line by line, the code tends to add three unnecessary characters in the front, namely "ï»¿".

Actual output:

ï»¿Man said:  Is this the right room for an argument?
Other Man said:  I've told you once.
Man said:  No you haven't!
Other Man said:  Yes I have.

Expected output:

Man said:  Is this the right room for an argument?
Other Man said:  I've told you once.
Man said:  No you haven't!
Other Man said:  Yes I have.

Your file is probably encoded in UTF-8 __with__ BOM. If this isn't what you want, encode it _without_ BOM. — Vincent Savard, Dec 21 '15 at 15:32
Possible duplicate of [How do I remove ï»¿ from the beginning of a file?](http://stackoverflow.com/questions/3255993/how-do-i-remove-%c3%af-from-the-beginning-of-a-file) — Marc B, Dec 21 '15 at 15:32
@MarcB Not a dupe of that; Python is not PHP, and has better options for handling the UTF-8 BOM. OP, pass `encoding='utf-8-sig'` to your `open()` call. — senshin, Dec 21 '15 at 15:33
Yes, Vincent is right. That's typical for the [Byte-order mark](https://en.wikipedia.org/wiki/Byte_order_mark). — Boldewyn, Dec 21 '15 at 15:33
@senshin it worked, Thanks. 'code' data=open('sketch.txt',encoding='utf-8-sig') — , Dec 21 '15 at 15:39
@vrkratheesh Not that I know of. You could obviously create a wrapper for `open`, e.g. `my_open = functools.partial(open, encoding='utf-8-sig')` and use that instead. (Ideally, you would just not encode your UTF-8 files with BOM, since UTF-8 is endianness-independent and doesn't need a BOM. Though if you're getting your files from some external source, I suppose that's not any easier.) — senshin, Dec 23 '15 at 13:38

score 106 · Accepted Answer · edited Jul 03 '23 at 16:23

106

Instead of opening the file with the default encoding (which is 'utf-8'), use 'utf-8-sig', which expects and strips off the UTF-8 Byte Order Mark, which is what shows up as ï»¿.

That is, instead of

data = open('info.txt')

Do

data = open('info.txt', encoding='utf-8-sig')

Note that if you're on Python 2, you should see e.g. Python, Encoding output to UTF-8 and Convert UTF-8 with BOM to UTF-8 with no BOM in Python. You'll need to do some shenanigans with codecs or with str.decode for this to work right in Python 2. But in Python 3, all you need to do is set the encoding= parameter when you open the file.

edited Jul 03 '23 at 16:23

wjandrea

28,235
9
60
81

answered Dec 21 '15 at 15:39

senshin

10,022
7
46
59

even after using this encoding i am still getting \ufeff in front of some line, any idea why ? – Amrit Dec 02 '20 at 18:49
@Amrit The BOM should only occur at the start of a text stream. So, if you're seeing it in the middle, then it's probably meant to be a [zero-width no-break space](https://en.wikipedia.org/wiki/Word_joiner) in legacy Unicode text. I believe it was used in Indian scripts like Devanagari. Or you have a file that's made up of a bunch of "utf-8-sig" files concatenated together. – wjandrea Jul 03 '23 at 16:35

score 4 · Answer 2 · answered Mar 12 '17 at 09:50

I had a very similar problem when dealing with excel csv files. Initially I had saved my file from the drop down choices as a .csv utf-8(comma delimited) file. Then I saved it as just a .csv(comma delimited) file and all was well. Perhaps there might be something similar issue with a .txt file

score 0 · Answer 3 · answered Apr 15 '20 at 17:24

When I had this happen, it only happened to the very first line of my CSV, both reading and writing. For what I was doing, I just made a "sacrificial" entry at the first location so that those charatcers would get added to my sacrifical entry and not any of the ones I cared about. Definitley not a robust solution but was quick and worked for my purposes.

Why does my Python code print the extra characters "ï»¿" when reading from a text file?

3 Answers3

Linked