How to remove BOM from an encoded base64 UTF string?

Question

I have a file encoded in base64 using openssl base64 -in en -out en1 in a command line in MacOS and I am reading this file using the following code:

string fileContent = File.ReadAllText(Path.Combine(AppContext.BaseDirectory, MConst.BASE_DIR, "en1"));
var b1 = Convert.FromBase64String(fileContent);
var str1 = System.Text.Encoding.UTF8.GetString(b1);

The string I am getting has a ? before the actual file content. I am not sure what's causing this, any help will be appreciated.

Example Input:

import pandas
import json

Encoded file example:

77u/DQppbXBvcnQgY29ubmVjdG9yX2FwaQ0KaW1wb3J0IGpzb24NCg0K

Output based on the C# code:

?import pandas
import json

@gunr2171 done, let me know if that doesn't satisfy what you want. — qwerty, May 12 '22 at 22:37
What's an example of `fileContent`, which should be a base64 encoded value? — gunr2171, May 12 '22 at 22:40
Are you sure you should be using UTF8 for the encoding, not ASCII? https://wiki.openssl.org/index.php/Base64 — gunr2171, May 12 '22 at 22:41
77u is the byte order marker, see https://stackoverflow.com/q/59882396/7329832 — jps, May 12 '22 at 22:47
@gunr2171 actually, no adding `System.Text.Encoding.ASCII.GetString(b1);` add `???` instead of `?` — qwerty, May 12 '22 at 22:47
@jps should the `encoding.utf8`take care of that, according to that post. Am i missing something? — qwerty, May 12 '22 at 23:32
Documentation says you should trim the BOM using `TrimStart`, see https://learn.microsoft.com/en-us/dotnet/api/system.text.utf8encoding.getstring?view=net-6.0 — Charlieface, May 12 '22 at 23:56
@qwerty when I decode the string to utf-8 on base64decode.org or with node.js there is no '?' in the output. The BOM is an invisible zero-space character. I have no idea why it is different on C#. — jps, May 13 '22 at 21:13
@Charlieface TrimStart expects a particular string to Trim, correct? Will the encoding always have that specific string I have mentioned in my question? — qwerty, May 16 '22 at 23:31
Of course it could be different. C# is *documented* to decode the BOM into the string, other decoders do not. For example, SQL Server also shows the BOM https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=e33a7334ad7e09a15f5a70934f9c0104 `TrimStart` without any parameters removes all whitespace, and removes the BOM. It's reasonably efficient if it's not there (it will not create a new string unnecessarily). I don't know what your data will have, I don't have X-ray vision. I suggest you check what you have. — Charlieface, May 17 '22 at 00:29
@Charlieface Much appreciated! Yeah, for some reason C# isn't able to. What I meant by the specific string was that will BOM always be a `77u/`. Didn't expect you to foresee what my string data will look like. If you would like to post that as your answer, I'll gladly accept it. — qwerty, May 17 '22 at 00:42
Does this answer your question? [What's the difference between UTF-8 and UTF-8 without BOM?](https://stackoverflow.com/questions/2223882/whats-the-difference-between-utf-8-and-utf-8-without-bom) This should give you a button to click above to accept — Charlieface, May 17 '22 at 00:43
@Charlieface it answers my question about how the BOMs could be different. Unfortunately, the question of how to decode a UTF-8 string in C# with BOM still persists. — qwerty, May 18 '22 at 06:33
See https://stackoverflow.com/questions/1003275/how-to-convert-utf-8-byte-to-string. `Encoding.UTF8.GetString` and then strip the BOM. Or you can just check the first three bytes and skip them. Also if you use `Encoding.UTF8` as opposed to `Encoding.ASCII` it at least converts the BOM correctly into `0xFEFF`. Incidentally when I tested it `TrimStart` did not remove it unless you use `TrimStart((char)0xFEFF)` — Charlieface, May 18 '22 at 08:29

score 1 · Accepted Answer · answered May 18 '22 at 07:25

Normally, when you read UTF (with BOM) from a text file, the decoding is handled for you behind the scene. For example, both of the following lines will read UTF text correctly regardless of whether or not the text file has a BOM:

File.ReadAllText(path, Encoding.UTF8);
File.ReadAllText(path); // UTF8 is the default.

The problem is that you're dealing with UTF text that has been encoded to a Base64 string. So, ReadAllText() can no longer handle the BOM for you. You can either do it yourself by (checking and) removing the first 3 bytes from the byte array or delegate that job to a StreamReader, which is exactly what ReadAllText() does:

var bytes = Convert.FromBase64String(fileContent);
string finalString = null;

using (var ms = new MemoryStream(bytes))
using (var reader = new StreamReader(ms))  // Or:
// using (var reader = new StreamReader(ms, Encoding.UTF8))
{
    finalString = reader.ReadToEnd();
}
// Proceed to using finalString.

How to remove BOM from an encoded base64 UTF string?

1 Answers1