I am using Python (3.4, on Windows 7) to download a set of text files, and when I read (and write, after modifications) these files appear to have a few byte order marks (BOM) among the values that are retained, primarily UTF-8 BOM. Eventually I use each text file as a list (or a string) and I cannot seem to remove these BOM. So I ask whether it is possible to remove the BOM?
For more context, the text files were downloaded from a public ftp source where users upload their own documents, and thus the original encoding is highly variable and unknown to me. To allow the download to run without error, I specified encoding as UTF-8 (using latin-1 would give errors). So it's not a mystery to me that I have the BOM, and I don't think an up-front encoding/decoding solution is likely to be answer for me (Convert UTF-8 with BOM to UTF-8 with no BOM in Python) - it actually appears to make the frequency of other BOM increase.
When I modify the files after download, I use the following syntax:
with open(t, "w", encoding='utf-8') as outfile:
with open(f, "r", encoding='utf-8') as infile:
text = infile.read
#Arguments to make modifications follow
Later on, after the "outfiles" are read in as a list I see that some words have the UTF-8 BOM, like \ufeff
. I try to remove the BOM using the following list comprehension:
g = list_outfile #Outfiles now stored as list
g = [i.replace(r'\ufeff','') for i in g]
While this argument will run, unfortunately the BOM remain when, for example, I print the list (I believe I would have a similar issue even if I tried to remove BOM from strings and not lists: How to remove this special character?). If I put a normal word (non-BOM) in the list comprehension, that word will be replaced.
I do understand that if I print the list object by object that the BOM will not appear (Special national characters won't .split() in Python). And the BOM is not in the raw text files. But I worry that those BOM will remain when running later arguments for text analysis and thus any object that appears in the list as \ufeffword
rather than word
will be analyzed as \ufeffword
.
Again, is it possible to remove the BOM after the fact?