4

Here is a code that writes the string to a file

System.IO.File.WriteAllText("test.txt", "P                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 ");

It's basically the character 'P' followed by a total of 513 space character.

When I open the file in Notepad++, it appears to be fine. However, when I open in windows Notepad, all I see is garbled characters.

If instead of 513 space character, I add 514 or 512, it opens fine in Notepad.

What am I missing?

amyn
  • 922
  • 11
  • 24
  • 7
    please use `"P" + new string(' ', 513)` – Tigran Aug 15 '18 at 19:28
  • 1
    Notepad is incorrectly detecting the file encoding. `File.WriteAllText(string,string)` uses UTF-8 by default. –  Aug 15 '18 at 19:28
  • 1
    See [the Notepad file-encoding problem](https://blogs.msdn.microsoft.com/oldnewthing/20070417-00/?p=27223). If you want to fix the problem, specify UTF-8 and start your file with a BOM. – Dour High Arch Aug 15 '18 at 19:30
  • @Tigran same result – amyn Aug 15 '18 at 19:30
  • 4
    @amyn His suggestion wasn't to fix your problem. It's to shorten your code, because 513 spaces in a string is nuts. –  Aug 15 '18 at 19:31
  • 1
    @amyn: it was for shortness of writing, and not a solution. – Tigran Aug 15 '18 at 19:31
  • @DourHighArch that's what I wanted. thanks – amyn Aug 15 '18 at 19:35
  • `File.WriteAllText(@"test.txt", ("P" + new string(' ', 513)), Encoding.UTF8);` – Jimi Aug 15 '18 at 19:40
  • @Jimi I already tried that and that does open up fine in Notepad. What I was curious was about the number 513. Doesn't happen with 512 or 514 but 513. – amyn Aug 15 '18 at 19:41
  • Have you ever wondered why Notepad is sooo slooow in opening some simple text files which take a fraction of that time to load anywhere else? Your text file begins with `50 20`. Now, is it Unicode (UTF16LE)? Is it the other Unicode (UTF8), the other Unicode (...), Is it ANSI. Anxiety. Better recheck. No wait, lets read the WHOLE file an get the Byte Order Mark then re-encode. Then I'll be sure... – Jimi Aug 15 '18 at 19:52
  • Notepad's file format detection routine works nearly all the time. But, not always. It probably has something to do with the fact that 512 is a nice round number (2 to the 9th). So, it sees a file that is 514 characters long and sees the last 512 of them as spaces, and figures the first two bytes must be some kind of badly formed byte order mark: (0x50, 0x20) and says "oh well, let me fall off the deep end". Report it as a bug to Microsoft (but don't hold your breath about getting it fixed). – Flydog57 Aug 15 '18 at 19:58
  • @Flydog57 Well, 103 should be enough to throw it off (can't test now, just a guess). – Jimi Aug 15 '18 at 20:06
  • Forgot to mention: check out what character is Unicode `U+2020`. – Jimi Aug 15 '18 at 20:14
  • See my answer below. There's nothing magic about 513, just that an odd number of spaces plus the original single character yields an even number of bytes. You're right that 512 and 514 don't duplicate your issue, but other combinations/lengths that result in an even number of _bytes_ reproduce it quite easily. – Dusty Aug 15 '18 at 20:40

2 Answers2

5

What you are missing is that Notepad is guessing, and it is not because your length is specifically 513 spaces ... it is because it is an even number of bytes and the file size is >= 100 total bytes. Try 511 or 515 spaces ... or 99 ... you'll see the same misinterpretation of your file contents. With an odd number of bytes, Notepad can assume that your file is not any of the double-byte encodings, because those would all result in 2 bytes per character = even number of total bytes in the file. If you give the file a few more low-order ASCII characters at the beginning (e.g., "PICKLE" + spaces), Notepad does a much better job of understanding that it should treat the content as single-byte chars.

The suggested approach of including Encoding.UTF8 is the easiest fix ... it will write a BOM to the beginning of the file which tells Notepad (and Notepad++) what the format of the data is, so that it doesn't have to resort to this guessing behavior (you can see the difference between your original approach and the BOM approach by opening both in Notepad++, then look in the bottom-right corner of the app. With the BOM, it will tell you the encoding is UTF-8-BOM ... without it, it will just say UTF-8).

I should also say that the contents of your file are not 'wrong', per se... the weird format is purely due to Notepad's "guessing" algorithm. So unless it's a requirement that people use Notepad to read your file with 1 letter and a large, odd number of spaces ... maybe just don't sweat it. If you do change to writing the file with Encoding.UTF8, then you do need to ensure that any other system that reads your file knows how to honor the BOM, because it is a real change to the contents of your file. If you cannot verify that all consumers of your file can/will handle the BOM, then it may be safer to just understand that Notepad happens to make a bad guess for your specific use case, and leave the raw contents exactly how you want them.

You can verify the physical difference in your file with the BOM by doing a binary read and then converting them to a string (you can't "see" the change with ReadAllText, because it honors & strips the BOM):

byte[] contents = System.IO.File.ReadAllBytes("test.txt");
Console.WriteLine(Encoding.ASCII.GetString(contents));
Dusty
  • 3,946
  • 2
  • 27
  • 41
  • Yes, if you aren't telling a program which encoding a text file uses, it is guessing. (Unless there is some standard that applies such as JSON, XML, HTML, …, or other source of this essential information such as project file, etc.) – Tom Blodget Aug 15 '18 at 23:06
3

Try passing in a different encoding:

i. System.IO.File.WriteAllText(filename , stringVariable, Encoding.UTF8);
ii. System.IO.File.WriteAllText(filename , stringVariable, Encoding.UTF32);
iii. etc.

Also You could try using another way to build your string, to make it be easier to read, change and count, instead of tapping the space bar 513 times;

i. Use the string constructor (like @Tigran suggested)

var result = "P" + new String(' ', 513);

ii. Use the stringBuilder

var stringBuilder = new StringBuilder();
stringBuilder.Append("P");

for (var i = 1; i <= 513; i++) { stringBuilder.Append(" "); }

iii. Or both

public string AppendSpacesToString(string stringValue, int numberOfSpaces) 
{
    var stringBuilder = new StringBuilder();
    stringBuilder.Append(stringValue);
    stringBuilder.Append(new String(' ', numberOfSpaces));
    return stringBuilder.ToString();
}
Nathan Alard
  • 1,753
  • 17
  • 9