3

Note: The possible duplicate concerns an older version of Python and this question has already generated unique answers.

I have been working on a script to process Project Gutenberg Texts texts into an internal file format for an application I am developing. In the script I process chapter headings with the re module. This works very well except in one case: the first line. My regex will always fail on the first Chapter marker at the first line if it includes the ^ caret to require the regex match to be at the beginning of the line because the BOM is consumed as the first character. (Example regex: ^Chapter).

What I've discovered is that if I do not include the caret, it won't fail on the first line, and then <feff> is included in the heading after I've processed it. An example:

<h1><feff>Chapter I</h1>

The advice according to this SO question (from which I learned of the BOM) is to fix your script to not consume/corrupt the BOM. Other SO questions talk about decoding the file with a codec but discuss errors I never encounter and do not discuss the syntax for opening a file with the template decoder.

To be clear:

I generally use pipelines of the following format:

cat -s <filename> | <other scripts> | python <scriptname> [options] > <outfile>

And I am opening the file with the following syntax:

import sys

fin = sys.stdin

if '-i' in sys.argv: # For command line option "-i <infile>"
    fin = open(sys.argv[sys.argv.index('-i') + 1], 'rt')

for line in fin:
    ...Processing here...

My question is what is the proper way to handle this? Do I remove the BOM before processing the text? If so, how? Or do I use a decoder on the file before processing it (I am reading from stdin, so how would I accomplish this?)

The files are stored in UTF-8 encoding with DOS endings (\r\n). I convert them in vim to UNIX file format before processing using set ff=unix (I have to do several manual pre-processing tasks before running the script).

mas
  • 1,155
  • 1
  • 11
  • 28
  • 2
    Hmm, `fin = sys.argv[sys.argv.index('-i') + 1]` should give you a filename in `fin`. It should then be opened with an `open` call that you have not shown and that is the place where you could declare that you want to filter the BOM out. Could you please show your `open` instruction? – Serge Ballesta Jul 23 '18 at 14:04
  • @Serge I apologize. I typed it from memory and forgot to include the open. However, I mostly use `sys.stdin` because I've been using it in pipelines. I would especially like to know how to declare it with `sys.stdin`. – mas Jul 23 '18 at 15:21
  • Possible duplicate of [Convert UTF-8 with BOM to UTF-8 with no BOM in Python](https://stackoverflow.com/questions/8898294/convert-utf-8-with-bom-to-utf-8-with-no-bom-in-python) – tripleee Jul 23 '18 at 15:29
  • 1
    Python 3 should transparently normalize the line endings with text files (Python 2 had `'Ur' ` for opening a file for reading with line-ending normalization) . The gist of the proposed duplicate is to use the `utf-8-sig` encoding when opening the file to transparently ignore the BOM, too. – tripleee Jul 23 '18 at 15:32
  • 1
    If you are preprocessing the files anyway, it might be the easiest to chop it off in that process. Check the first character and remove it if it is the "zero-width non-breaking space". – lenz Jul 23 '18 at 15:36
  • @tripleee: Unfortunately, when I process a DOS-ending line `re.search('^$')` fails to match blank lines. – mas Jul 23 '18 at 15:38
  • @lenz: How would I go about checking if it is the "zero-width non-breaking space"?? – mas Jul 23 '18 at 15:38
  • 1
    See e.g. https://stackoverflow.com/questions/45240387/how-can-i-remove-the-bom-from-a-utf-8-file – tripleee Jul 23 '18 at 15:44
  • I think tripleee's link should help you; it really depends on the tool how to spell a specific Unicode character. – lenz Jul 23 '18 at 15:47
  • To note for future readers: tripleee's link is excellent. – mas Jul 23 '18 at 15:51

4 Answers4

2

As a complement to the existing answer, it is possible to filter the UTF8 BOM from stdin with the codecs module. Simply you must use sys.stdin.buffer to access the underlying byte stream and decode it with a StreamReader

import sys
import codecs

# trick to process sys.stdin with a custom encoding
fin = codecs.getreader('utf_8_sig')(sys.stdin.buffer, errors='replace')


if '-i' in sys.argv: # For command line option "-i <infile>"
    fin = open(sys.argv[sys.argv.index('-i') + 1], 'rt',
               encoding='utf_8_sig', errors='replace')

for line in fin:
    ...Processing here...
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
  • This actually seems to be the most elegant solution as it (seems to me) to be more portable than the other solutions. How will it handle non-utf-8 encoded scripts? Will it choke? – mas Jul 23 '18 at 16:17
  • My comment seems to be due to a misunderstanding of BOMs and character encoding. After perusing [this Unix.SE discussion](https://chat.stackexchange.com/rooms/62721/discussion-on-answer-by-stephane-chazelas-how-can-i-remove-the-bom-from-a-utf-8) as well as [this quora question](https://www.quora.com/How-common-is-UTF-16BE-or-even-UTF-32-for-text-files-e-g-XML-or-JSON-yes-I-JSON-only-allows-UTF-8) I have come to the conclusion that I will probably never need to worry about the BOM except to remove it and therefore take this answer as the final, most elegant and portable solution. – mas Jul 23 '18 at 17:08
1

In Python 3, stdin should be auto-decoded properly, but if it's not working for you (and for Python 2) you need to specify PythonIOEncoding before invoking your script like

PYTHONIOENCODING="UTF-8-SIG" python <scriptname> [options] > <outfile>

Notice that this setting also makes stdout working with UTF-8-SIG, so your <outfile> will maintain the original encoding.

For your -i parameter, just do open(path, 'rt', encoding="UTF-8-SIG")

wiesion
  • 2,349
  • 12
  • 21
  • Can I just export an environment variable or does your solution require `PYTHONIOENCODING="UTF-8-SIG"` to be declared while I'm running the script? – mas Jul 23 '18 at 15:41
  • 1
    Yes, declaring it as an environment variable should affect all python scripts – wiesion Jul 23 '18 at 15:46
  • 1
    If you set it earlier in the script, it will remain set for the duration of the script (unless you explicitly unset or change it, of course). You need to `export` it so it's visible to subprocesses such as Python. – tripleee Jul 23 '18 at 15:46
  • @tripleee: When you say "set it earlier in the script" is the script you are referring to my python script of my pipeline-command? I am curious because if I can just write this into my python script toward the top that might be the simplest solution. – mas Jul 23 '18 at 15:53
  • 1
    The shell script containing the pipeline. If the Python script is simple you can embed it in the shell script but I would probably look into doing the preprocessing in Python too instead. – tripleee Jul 23 '18 at 15:55
1

You really don't need to import codecs or anything to deal with this. As lenz suggested in comments just check for the BOM and throw it out.

for line in input:
    if line[0] == "\ufeff":
        line = line[1:] # trim the BOM away

    # the rest of your code goes here as usual
polm23
  • 14,456
  • 7
  • 35
  • 59
0

In Python 3.9 default encoding for standard input seems to be utf-8, at least on Linux:

In [2]: import sys
In [3]: sys.stdin
Out[3]: <_io.TextIOWrapper name='<stdin>' mode='r' encoding='utf-8'>

sys.stdin has the method reconfigure():

sys.stdin.reconfigure("utf-8-sig")

which should be called before any attempt of reading the standard input. This will decode the BOM, which will no longer appear when reading sys.stdin.