Confusion regarding UTF8 substring length

Question

Can someone please help me deal with byte-order mark (BOM) bytes versus UTF8 characters in the first line of an XHTML file?

Using Python 3.5, I opened the XHTML file as UTF8 text:

inputTopicFile = open(inputFileName, "rt", encoding="utf8")

As shown in this hex-editor, the first line of that UTF8-encoded XHTML file begins with the three-bytes UTF8 BOM EF BB BF:

I wanted to remove the UTF8 BOM from what I supposed were equivalent to the three initial character positions [0:2] in the string. So I tried this:

firstLine = firstLine[3:]

Didn't work -- the characters <? were no longer present at the start of the resulting line.

So I did this experiment:

for charPos in range(0, 3):
    print("charPos {0} == {1}".format(charPos, firstLine[charPos]))

Which printed:

charPos 0 == 
charPos 1 == <
charPos 2 == ?

I then added .encode to that loop as follows:

for charPos in range(0, 3):
    print("charPos {0} == {1}".format(charPos, eachLine[charPos].encode('utf8')))

Which gave me:

charPos 0 == b'\xef\xbb\xbf'
charPos 1 == b'<'
charPos 2 == b'?'

Evidently Python 3 in some way "knows" that the 3-bytes BOM is a single unit of non-character data? Meaning that one cannot try to process the first three 8-bit bytes(?) in the line as if they were UTF8 characters?

At this point I know that I can "trick" my code into giving me with I want by specifying firstLine = firstLine[1:]. But it seems wrong to do it that way(?)

So what's the correct way to discard the first three BOM bytes in a UTF8 string on the way to working with only the UTF8 characters?

EDIT: The solution, per the comment made by Anthony Sottile, turned out to be as simple as using encoding="utf-8-sig" when I opened the source XHTML file:

inputTopicFile = open(inputFileName, "rt", encoding="utf-8-sig")

That strips out the BOM. Voila!

How are you actually opening the file? Python 3 doesn't give you bytes unless you ask for them. — Josh Lee, Jul 24 '17 at 21:23
https://stackoverflow.com/questions/8898294/convert-utf-8-with-bom-to-utf-8-with-no-bom-in-python — Josh Lee, Jul 24 '17 at 21:24
@josh-lee I edited-in the file open method I used to the question. Also added a second loop that uses .encode to see what I'd get. — RBV, Jul 24 '17 at 21:42
Try using the `utf8-sig` encoding instead (it'll remove the byte-order-marker for you) — anthony sottile, Jul 24 '17 at 21:43
The BOM works in part because it's an encoding of a single Unicode character (U+FEFF, ZERO WIDTH NO-BREAK SPACE). This is why Python treats it as a single character: because it *is* a single character. — jwodder, Jul 24 '17 at 21:58
@AnthonySottile Yes, thanks for that! One can but suppose that `encoding="utf-8-sig"` came into being because others have had the same problem as I. Thx 'gain... — RBV, Jul 24 '17 at 21:59

score 1 · Answer 1 · answered Jul 24 '17 at 22:21

As you mentioned in your edit, you can open the file with the utf8-sig encoding, but to answer your question of why it was behaving this way:

Python 3 distinguishes between byte strings (the ones with the b prefix) and character strings (without the b prefix), and prefers to use character strings whenever possible. A byte string works with the actual bytes; a character string works with Unicode codepoints. The BOM is a single codepoint, U+FEFF, so in a regular string Python 3 will treat it as a single character (because it is a single character). When you call encode, you turn the character string into a byte string.

Thus the results you were seeing are exactly what you should have: Python 3 does know what counts as a single character, which is all it sees until you call encode.

Confusion regarding UTF8 substring length

1 Answers1