0

The project I'm working on takes xml files and input streams and converts them to pdf's and text. In the unit tests I compare this generated text with a .txt file that has the expected output.

I'm now facing the issue of these .txt files not being encoded in UTF-8 and been written without persisting this information (namely umlauts).

I have read few articles on the topic of persisting and encoding .txt files. Including correcting the encoding, saving and opening files in Visual Studio with encoding, and some more.

I was wondering if there is a text file format that supports meta information about encoding like xml or html for example does.

I'm looking for a solution that is:

  • Easy adaptable to any coworker on the same team
  • It being persitant and not depending on me choosing an encoding in an editor
  • Does not require any additional exotic program
  • Can be read without or only little modification of the File class and it's input reading of C#
  • Does at least support UTF-8 encoding
Peter
  • 1,844
  • 2
  • 31
  • 55
  • Why not just agree that all text files will be UTF-8? – Tom Blodget Jul 13 '18 at 01:36
  • I think this is more a confiuration isse than an agreement. I see the possibility in it, yet I think a file that supports annotation might be more helpful. – Peter Jul 13 '18 at 05:59

1 Answers1

1

A Unicode Byte Order Mark (BOM) is sometimes used for this purpose. Systems that process Unicode are required to strip off this metadata when passing on the text. File.ReadAllText etc do this. A BOM should exist only at the beginning of files and streams.

A BOM is sometimes conflated with encoding because both affect the file format and BOM applies only to Unicode encodings. In Visual Studio, with UTF-8, it's called "Unicode (UTF-8 with signature) - Codepage 65001".

Some C# code that demonstrates these concepts:

var path = Path.GetTempFileName() + ".txt";
File.WriteAllText(path, "Test", new UTF8Encoding(true, true));
Debug.Assert(File.ReadAllBytes(path).Length == 7);
Debug.Assert(File.ReadAllText(path).Length == 4); // slightly mushy encoding detection

However, this doesn't get anyone past the agreement required when using text files. The fundamental rule is that a text file must be read with the same encoding it was written with. A BOM is not a communication that suffices as a complete agreement for text files in general.

Test editors almost universally adopt the principle that they should guess a file's character encoding first, and—for the most part—allow users to correct them later. Some IDEs with project systems allow recording which encoding a file actually uses.

A reasonable text editor would preserve both the encoding and the presence of a Unicode BOM for existing files.

It seems that you're after a universal strategy. Unfortunately, the history of the concept of a text file doesn't allow one.

Tom Blodget
  • 20,260
  • 3
  • 39
  • 72