Python add Unicode characters at the start of a file

Question

I use a script to update the version of each AssemblyVersion.cs file of a .NET project. It always worked perfectly, but since a format my PC, it adds unicode character at the start of each .cs file edited. Like this:

ï»¿Ã¯Â»Â¿using System.Reflection;
using System.Runtime.InteropServices;
using System.Security;

I use this code to open a file:

with open(fname,  "r") as f:
    out_fname = fname + ".tmp"
    out = codecs.open(out_fname, "w", encoding='utf-8')
    textInFile=""
    for line in f:
        textInFile += (re.sub(pat, s_after,line))
    out.write(u'\uFEFF')
    out.write(textInFile)
    out.close()
os.remove(fname)
os.rename(out_fname, fname)

I've also tried, as wrote here, to use io instead of codecs, but nothing is changed.

On other teammates' PCs it works with the same configuration (Win10 and IronPython 2.7).

What can I try to solve this issue? Where can I looking for the problem?

Thanks

nempat · Answer 1 · 2016-10-18T10:35:44.987

0

It seems that the files at your file system are using ISO-8859-1 encoding, while you are adding the UT8 BOM marker at the beginning of each file.

After your code does it's job, you get a file with UTF-8 BOM + ISO-8859-1 meta at the beginning.

I would check the encoding of your input files before modification with Notepad++ (or any other editor) just to see if the scenario I described is valid. If it is, you will need to read your input files with a different encoding in order to avoid the metadata:

with open(fname,  "r",  "ISO-8859-1") as f:
    ...

edited Oct 18 '16 at 10:35

answered Oct 18 '16 at 10:08

nempat

456
3
9

Sorry if I'm late. However, the encoding of files processed is UTF-8 BOM (in the specific, these files are the AssemblyInfo.cs of a .NET project). I've also tried to add "ISO-8859-1" as you indicated, in read and write methods too, but it doesn't work. – Krusty Oct 27 '16 at 12:33
If the files you are processing are UTF-8 BOM then you should use 'utf-8-sig' encoding not the regular 'utf-8'. Maybe that is the issue, as you are reading it like a regular UTF-8 file the BOM marker gets read and appended at the beginning of a file that already has a BOM that you wrote manually. – nempat Oct 27 '16 at 12:52
I've tried in this way but it still doesn't work `with codecs.open(fname, "r") as f: out_fname = fname + ".tmp" out = codecs.open(out_fname, "w", encoding='utf-8-sig')` Any other suggestion? – Krusty Oct 27 '16 at 13:45
with codecs.open(fname, "r", "utf-8-sig") as f: I believe that the source file is the main cause, and that opening it in the correct format is necessary. – nempat Oct 27 '16 at 15:50

Python add Unicode characters at the start of a file

1 Answers1