Encoding, decoding and re-encoding not producing original results

Question

I suspect that my decoding is not working properly. That is why I am testing it by encoding, decoding and the re-encoding to see if I am getting the same result. That is however not the case.

I encoded a byte[] named model.PDF to a base64 string.

Now, for decoding, I converted model.PDF to a decoded base64 string. However the output looks faulty or corrupted upon debugging and I suspect this is where something is going wrong.

To encode again, the decoded data is turned into byte[] again and then into an encoded base64 string. However base64EncodedData does not match plainTextEncodedData. Please help me create a flawless encode to decode to re-encode flow.

// ENCODING - Byte array -> base64 encoded string
string base64EncodedData = Convert.ToBase64String(model.PDF);

// DECODING - Byte array -> base64 decoded string
var base64DecodedData = Encoding.UTF8.GetString(model.PDF);

// ENCODING AGAIN
byte[] plainTextBytes = Encoding.UTF8.GetBytes(base64DecodedData);
var plainTextEncodedData = Convert.ToBase64String(plainTextBytes);

To elaborate, the re-encoding matches the initial encoding perfectly if executed like this.

var PDF = System.Text.Encoding.UTF8.GetBytes("redgreenblue");

string base64EncodedData  = Convert.ToBase64String(PDF);

// DECODING - Byte array -> base64 decoded string
var base64DecodedData = Encoding.UTF8.GetString(PDF);

// ...

But, my model.PDF is fetched from the database as shown below, in which case the re-encoding does not match.

while (reader.Read()) {
    model.PDF = reader["PDF"] == DBNull.Value ? null : (byte[])reader["PDF"];
}

On an online base64 decoder (https://www.base64decode.org/), decoding an example value of base64EncodedData shows the ideal and correct value.

%PDF-1.5
%
1 0 obj
<</Type/Catalog/Pages 2 0 R/Lang(en-IN) /StructTreeRoot 8 0 R/MarkInfo<</Marked true>>>>
endobj
2 0 obj
<</Type/Pages/Count 1/Kids[ 4 0 R] >>
endobj
3 0 obj
<</Author(admin) /CreationDate(D:20190724114817+05'30') 
/ModDate(D:20190724114817+05'30') /Producer(Microsoft Excel 2013) /Creator(Microsoft Excel 2013) >>
endobj
4 0 obj
<</Type/Page/Parent 2 0 R/Resources<</Font<</F1 6 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI] >>/MediaBox[ 0 0 612 792] /Contents 5 0 R/Group<</Type/Group/S/Transparency/CS/DeviceRGB>>/Tabs/S/StructParents 0>>
endobj
5 0 obj
<</Filter/FlateDecode/Length 171>>
stream

...

However, in my program, the value of base64DecodedData shows up in its entirety as:

%PDF-1.5
%����
1 0 obj
<</Type/Catalog/Pages 2 0 R/Lang(en-IN) /StructTreeRoot 8 0 R/MarkInfo<</Marked true>>>>
endobj
2 0 obj
<</Type/Pages/Count 1/Kids[ 4 0 R] >>
endobj
3 0 obj
<</Author(admin) /CreationDate(D:20190724114817+05'30') 
/ModDate(D:20190724114817+05'30') /Producer(��

The 2 look similar in ways but my program seems to be producing a corrupt version of what the actual base64 decoded string should be.

Encoding.UTF8 will not create base64 strings. It interpreters bytes as binary saved utf8 data. You should check each variable in the debugger. — Ray, Jul 27 '19 at 07:36
`Convert.FromBase64String(base64EncodedData)` This just reverses the previous step. What's the point of doing that? And the variable name `base64EncodedBytes` --> Umm nope, nothing is encoded here; you're just back where you started. — 41686d6564 stands w. Palestine, Jul 27 '19 at 07:39
`FromBase64String` decodes, `ToBase64String` encodes. That's it. Why are you doing all this `Encoding.UTF8.GetString` and `Encoding.UTF8.GetBytes` stuff? — Sweeper, Jul 27 '19 at 07:53
@AhmedAbdelhameed I changed the code and simplified it based on your comment. — Pezanne Khambatta, Jul 27 '19 at 07:56
@PezanneKhambatta I think your code now works, doesn't it? `base64EncodedData ` and `plainTextEncodedData` should be the same. — Sweeper, Jul 27 '19 at 08:07
Your problem has nothing to do with Base64. The root cause of the problem is that `Encoding.UTF8.GetBytes(Encoding.UTF8.GetString(someByteArray));` is _**not**_ guaranteed to return a byte array that is equivalent to the original `someByteArray`. Check [this question](https://stackoverflow.com/q/9740553/4934172) for more. Just use `Convert.ToBase64String` and `Convert.FromBase64String`. Are these two not enough? — 41686d6564 stands w. Palestine, Jul 27 '19 at 08:37
Possible duplicate of [Why isn't \`Encoding.UTF8.GetBytes(Encoding.UTF8.GetString(x))==x\`](https://stackoverflow.com/questions/9740553/why-isnt-encoding-utf8-getbytesencoding-utf8-getstringx-x) — 41686d6564 stands w. Palestine, Jul 27 '19 at 08:39
@AhmedAbdelhameed From cursory reading, that does not explain `faulty or corrupted` output as the OP stated. That shouldn't happen from encoding-decoding. It comes down to: `model.PDF` probably isn't correct UTF8. — KekuSemau, Jul 27 '19 at 09:00
@KekuSemau PDF [can have different encodings](https://stackoverflow.com/a/10656899/4934172). Moreover, even if it was valid UTF8 (which is unlikely), the `UTF8.GetBytes` method is [still not guaranteed](https://stackoverflow.com/a/9740590/4934172) to produce the same _original bytes_ as I stated above. — 41686d6564 stands w. Palestine, Jul 27 '19 at 09:21
Since when you use `Encoding.Something.GetString()` to get the bytes of a binary file? Unless you really want to destroy it. The byte of a binary file do not represent a string in any way. What code points should these values be converted to? — Jimi, Jul 27 '19 at 11:33
I would now mark this as a duplicate of https://superuser.com/questions/1445520/what-does-%C3%B6%C3%A4%C3%BC%C3%9F-in-the-2nd-line-of-pdf-files-mean, but I can't because it's not on SO. — KekuSemau, Jul 30 '19 at 05:57
Maybe you are using wrong way to read `model.pdf` file that corrupts the encoding. Which method you are using to Read `model.pdf` file? — Mustafa Salih ASLIM, Jul 30 '19 at 13:09

score 1 · Answer 1 · answered Jul 30 '19 at 05:58

1

A PDF is an ASCII file that can contain binary data (including strings in other encodings). So you cannot read it as plain text.

If a PDF file contains binary data, as most do [...] the header line shall be immediately followed by a comment line containing at least four binary characters—that is, characters whose codes are 128 or greater.

Taken from this answer, which has some more infos

You see exactly these four characters in your own output.

answered Jul 30 '19 at 05:58

KekuSemau

6,830
4
24
34

Fair enough. Could you please explain what you think the online converter is doing different from my program? – Pezanne Khambatta Jul 30 '19 at 12:57
1

It seems that the online tool ignores some non-printable bytes. – KekuSemau Jul 30 '19 at 13:31

Encoding, decoding and re-encoding not producing original results

1 Answers1