Remove byte order mark from objects in a list

Question

I am using Python (3.4, on Windows 7) to download a set of text files, and when I read (and write, after modifications) these files appear to have a few byte order marks (BOM) among the values that are retained, primarily UTF-8 BOM. Eventually I use each text file as a list (or a string) and I cannot seem to remove these BOM. So I ask whether it is possible to remove the BOM?

For more context, the text files were downloaded from a public ftp source where users upload their own documents, and thus the original encoding is highly variable and unknown to me. To allow the download to run without error, I specified encoding as UTF-8 (using latin-1 would give errors). So it's not a mystery to me that I have the BOM, and I don't think an up-front encoding/decoding solution is likely to be answer for me (Convert UTF-8 with BOM to UTF-8 with no BOM in Python) - it actually appears to make the frequency of other BOM increase.

When I modify the files after download, I use the following syntax:

with open(t, "w", encoding='utf-8') as outfile:
    with open(f, "r", encoding='utf-8') as infile:
        text = infile.read
        #Arguments to make modifications follow

Later on, after the "outfiles" are read in as a list I see that some words have the UTF-8 BOM, like \ufeff. I try to remove the BOM using the following list comprehension:

g = list_outfile    #Outfiles now stored as list
g = [i.replace(r'\ufeff','') for i in g]

While this argument will run, unfortunately the BOM remain when, for example, I print the list (I believe I would have a similar issue even if I tried to remove BOM from strings and not lists: How to remove this special character?). If I put a normal word (non-BOM) in the list comprehension, that word will be replaced.

I do understand that if I print the list object by object that the BOM will not appear (Special national characters won't .split() in Python). And the BOM is not in the raw text files. But I worry that those BOM will remain when running later arguments for text analysis and thus any object that appears in the list as \ufeffword rather than word will be analyzed as \ufeffword.

Again, is it possible to remove the BOM after the fact?

"and thus the original encoding is highly variable and unknown to me." - Then you should probably be opening in binary mode instead of text mode. You don't know the encoding, so don't lie and say you do. — Kevin, Jun 03 '15 at 19:53
@Kevin: Fair point. I did try binary mode, but ran into a problem writing the file later. I think I can solve that problem using `outfile.write(bytes(text, "UTF-8"))`, although I'm not quite sure if I then re-introduce an encoding (decoding?) problem (among other formatting issues to explore). — bauerandrew, Jun 11 '15 at 19:21
If you open both files in binary mode, you will read bytes objects (`str` in 2.x, which gets the alias `bytes` in 2.7, and is fully renamed in 3.x) and write bytes objects. No conversion should be necessary. — Kevin, Jun 11 '15 at 19:33

Maurice · Answer 1 · 2021-01-12T13:28:18.217

The problem is that you are replacing specific bytes, while the representation of your byte order mark might be different, depending on the encoding of your file. Actually checking for the presence of a BOM is pretty straightforward with the codecs library. Codecs has the specific byte order marks for different UTF encodings. Also, you can get the encoding automatically from an opened file, no need to specify it. Suppose you are reading a csv file with utf-8 encoding, which may or may not use a byte order mark. Then you could go about like this:

import codecs

with open("testfile.csv", "r") as csvfile:
    line = csvfile.readline()
    if line.__contains__(codecs.BOM_UTF8.decode(csvfile.encoding)):
        # A Byte Order Mark is present
        line = line.strip(codecs.BOM_UTF8.decode(csvfile.encoding))
    print(line)

In the output resulting from the code above you will see the output without byte order mark. To further improve on this, you could also restrict this check to be only done on the first line of a file (because that is where the byte order mark always resides, it is the first few bytes of the file). Using strip instead of replace won't replace anything and won't actually do anything if the indicated byte order mark is not present. So you may even skip the manual check for byte-order-mark altogether and just run the strip method on the entire contents of the file:

import codecs

with open("testfile.csv", "r") as csvfile:
    with open("outfile.csv", "w") as outfile:
        outfile.write(csvfile.read().strip(codecs.BOM_UTF8.decode(csvfile.encoding)))

Voila, you end up with 'outfile.csv' containing the exact contents of the original (testfile.csv) without the Byte Order Mark.

Remove byte order mark from objects in a list

1 Answers1