File.ReadAllText with UTF-7 ignoring + characters

Question

I have a file on the disk that has been written by the program, with some data encoded in Json.

I am using C#'s File.ReadAllText(string path, Encoding encoding) to read it later. For unrelated reasons, we have to work with UTF-7.

Our lines then looks like this:

var content = File.ReadAllText(fileName, Encoding.UTF7);

It works fine, writing then reading, for basically everything we need. The only exception is the plus sign (+). If there is a + sign in our file, this code returns the entire string ignoring all of those. So

{ "commandValue": "testvalue + otherValue" }

turns into

{ "commandValue": "testvalue  otherValue" }

I have checked the file bytes, and the + sign is indeed char 0x2B, which is the right character in UTF-7 (and also the same char in UTF-8, not sure if it matters).

I can't figure out why they disappear when reading it.

For the sake of tests, I have tried reading it with

var content = File.ReadAllText(fileName, Encoding.UTF8);

and it worked fine. The chars did not disappear.

What could I possibly be doing wrong, and how could I make File.ReadAllText(fileName, Encoding.UTF7) not ignore those characters?

As of now, I haven't found another char that has this problem, but I obviously did not test all of them.

Are you sure the file has been saved on utf and not in unicode? — Gusman, Aug 01 '17 at 19:49
'+' is a special character in UTF7 used to denote an escape sequence. To @Gusman's point, the string probably wasn't written using UTF7 encoding. When you're reading it as UTF7, that '+' is being viewed as the start of an escape sequence, but then no valid sequence is encountered, so the UTF7 encoder just 'eats' the '+'. If you were to put a '-' after each of the pluses in your file, your UTF7 decoding will work properly (i.e. all "+" becomes "+-")... at least for the pluses. The main problem, though, is that the string wasn't written to the file using a UTF7 encoder. — wablab, Aug 01 '17 at 19:59
@Gusman, I know you know. :) My comment was for the OP, and I was trying to give recognition to the fact that my comment was really just a fleshing-out of your (very valid) point. — wablab, Aug 01 '17 at 20:02
You're right guys, and using File.WriteAllText with and without UTF7 corroborates your claims. — Bent Tranberg, Aug 01 '17 at 20:04
@wablab I didn't make the section of the code that creates this file (or the section that reads it either, I was just assigned to the bug where certain sequences were changing between user input and final result). If you made that into an answer I'd certainly accept it :) — Kaito Kid, Aug 02 '17 at 11:26

score 5 · Accepted Answer · answered Aug 02 '17 at 14:22

5

The file is not being written using UTF7. The '+' is a special character in the UTF7 encoding scheme used to denote the start of a "modified base64" sequence. So, when the file is read as UTF7, the decoder sees the '+', expects a modified base64 sequence (but finds none), and then continues decoding the file as usual. The '+' is suppressed from the output as a result.

To fix the issue you're seeing, you could potentially try reading the file as UTF8, or you could update the code that writes the file to ensure that it uses UTF7 encoding.

answered Aug 02 '17 at 14:22

wablab

1,703
13
15

I suggest adding the bit about the "+-" sequence from your comment to this answer. Considering the fact that everyone that had worked on the write and read parts is currently on vacation, I can't really figure out why they chose those encodings for read and write, and won't change it without knowing the full scope of the effects. Temporarily, I have used the "+-" method as a placeholder for this user until I can get the input required. – Kaito Kid Aug 02 '17 at 17:55

File.ReadAllText with UTF-7 ignoring + characters

1 Answers1

Linked