base64 decode string - emacs different than jvm?

Question

With the base64-encoded string JVBERi0xLjENCiXi48/TDQoxIDAgb2JqDQo8PCAN I am getting difference results from emacs than from the clojure code below.

Can anyone explain to me why?

The elisp below gives the correct output, giving me ultimately a valid pdf document (when i past the entire string). I am sure my emacs buffer is set to utf-8:

(base64-decode-string "JVBERi0xLjENCiXi48/TDQoxIDAgb2JqDQo8PCAN")

"%PDF-1.1
 %âãÏÓ
 1 0 obj
 <<

Here is the same output with the chars in decimal (i think):

  "%PDF-1.1
  %\342\343\317\323
  1

The clojure below gives incorrect output, rendering the pdf document invalid when i give the entire string:

(import 'java.util.Base64 )

(defn decode  [to-decode]
  (let [
        byts           (.getBytes to-decode "UTF-8")
        decoded        (.decode (java.util.Base64/getDecoder) byts)
        ]
    (String. decoded "UTF-8")))


(decode "JVBERi0xLjENCiXi48/TDQoxIDAgb2JqDQo8PCAN")

"%PDF-1.1
%����
1 0 obj
<<

Same output, chars in decimal (i think). I couldn't even copy/paste this, i had to type it in. This is what it looks like when i opened the PDF in text-mode for the first three columns:

 "%PDF-1.1
  %\357\277\275\357\277\275\357\277\275\357\277\275
  1"

Edit Taking emacs out of the equation:

If i write the encoded string to a file called encoded.txt and pipe it through the linux program base64 --decode i get valid output and a good pdf also: This is clojure:

(defn decode  [to-decode]
  (let [byts        (.getBytes to-decode "ASCII")
        decoded     (.decode (java.util.Base64/getDecoder) byts)
        flip-negatives  #(if (neg? %) (char (+ 255 %)) (char %))
        ]
    (String. (char-array (map flip-negatives decoded)) )))

(spit "./output/decoded.pdf" (decode "JVBERi0xLjENCiXi48/TDQoxIDAgb2JqDQo8PCAN"))

(spit "./output/encoded.txt" "JVBERi0xLjENCiXi48/TDQoxIDAgb2JqDQo8PCAN")

Then this at the shell:

➜  output git:(master) ✗ cat encoded.txt| base64 --decode > decoded2.pdf 
➜  output git:(master) ✗ diff decoded.pdf decoded2.pdf 
2c2
< %áâÎÒ
---
> %����
➜  output git:(master) ✗

update - this seems to work

Alan Thompson's answer below put me on the correct track, but geez what a pain to get there. Here's the idea of what works:

(def iso-latin-1-charset (java.nio.charset.Charset/forName "ISO-8859-1" ))

(as-> some-giant-string-i-hate-at-this-point $
  (.getBytes $)
  (String. $   iso-latin-1-charset)
  (base64/decode $ "ISO-8859-1")
  (spit "./output/a-pdf-that-actually-works.pdf" $ :encoding "ISO-8859-1" ))

What is the full expected output? Can you also paste a (short) example with the integer value of each desired character? — Alan Thompson, Mar 14 '18 at 20:21
Your data contains bytes that cannot be decoded as valid UTF-8. It’s invalid UTF-8. The `String` constructor replaces the invalid bytes with the Unicode replacement character �. — glts, Mar 14 '18 at 21:28
Please see **Update #2** to my answer below. The problem is that the original text was ISO-8859-1, not UTF-8. — Alan Thompson, Mar 14 '18 at 22:11

Alan Thompson · Accepted Answer · 2018-03-15T15:09:55.823

Returning the results as a string, I get:

(b64/decode-str "JVBERi0xLjENCiXi48/TDQoxIDAgb2JqDQo8PCAN")  
  => "%PDF-1.1\r\n%����\r\n1 0 obj\r\n<< \r"

and as a vector of ints:

(mapv int (b64/decode-str "JVBERi0xLjENCiXi48/TDQoxIDAgb2JqDQo8PCAN")) 

  => [37 80 68 70 45 49 46 49 13 10 37 65533 65533 65533 65533 13 10 49 32 48 
      32 111 98 106 13 10 60 60 32 13]

Since both the beginning and end of the string look OK, I suspect the B64 string might be malformed?

Update

I went to http://www.base64decode.org and got the result

"Malformed input... :("

Update #2

The root of the problem is that the source characters are not UTF-8 encoded. Rather, they are ISO-8859-1 (aka ISO-LATIN-1) encoded. See this code:

  (defn decode-bytes
    "Decodes a byte array from base64, returning a new byte array."
    [code-bytes]
    (.decode (java.util.Base64/getDecoder) code-bytes))

  (def iso-latin-1-charset (java.nio.charset.Charset/forName "ISO-8859-1" )) ; aka ISO-LATIN-1

  (let [b64-str         "JVBERi0xLjENCiXi48/TDQoxIDAgb2JqDQo8PCAN"
        bytes-default   (vec (.getBytes b64-str))
        bytes-8859      (vec (.getBytes b64-str iso-latin-1-charset))

        src-byte-array  (decode-bytes (byte-array bytes-default))
        src-bytes       (vec src-byte-array)
        src-str-8859    (String. src-byte-array iso-latin-1-charset)
        ]...  ))

with result:

iso-latin-1-charset => <#sun.nio.cs.ISO_8859_1 #object[sun.nio.cs.ISO_8859_1 0x3edbd6e8 "ISO-8859-1"]>

bytes-default  => [74 86 66 69 82 105 48 120 76 106 69 78 67 105 88 105 52 56 47 84 68 81 111 120 73 68 65 103 98 50 74 113 68 81 111 56 80 67 65 78]
bytes-8859     => [74 86 66 69 82 105 48 120 76 106 69 78 67 105 88 105 52 56 47 84 68 81 111 120 73 68 65 103 98 50 74 113 68 81 111 56 80 67 65 78]

(= bytes-default bytes-8859) => true

src-bytes      => [37 80 68 70 45 49 46 49 13 10 37 -30 -29 -49 -45 13 10 49 32 48 32 111 98 106 13 10 60 60 32 13]
src-str-8859   => "%PDF-1.1\r\n%âãÏÓ\r\n1 0 obj\r\n<< \r"

So the java.lang.String constructor will work correctly with a byte[] input, even when the high bit is set (making them look like "negative" values), as long as you tell the constructor the correct java.nio.charset.Charset to use for interpreting the values.

Interesting that the object type is sun.nio.cs.ISO_8859_1.

Update #3

See the SO question below for a list of libraries that can (usually) autodetect the encoding of a byte stream (e.g. UTF-8, ISO-8859-1, ...)

What is the most accurate encoding detector?

Hmm... very strange. I'm not sure why that website would complain but both emacs and clojure wouldn't. I don't understand. If you turn off `live mode` it does give you something back. — joefromct, Mar 14 '18 at 21:08
Thanks, Update #2 got me on the right track. I have something working now. — joefromct, Mar 15 '18 at 02:19

Piotrek Bzdyl · Answer 2 · 2018-03-14T20:59:29.217

0

I think you need to verify the actual bytes that are produced in both scenarios. I would save both decoded results in a file and then compare them using for example xxd command line tool to get the hex display of the bytes in the files.

I suspect your emacs and clojure application uses different font causing the same non-ASCII bytes to be rendered differently, e.g. the same byte value is rendered as â in emacs and � in clojure output.

I would also check if elisp indeed creates the resulting string using UTF-8. base64-decode-string mentions unibytes and I am not sure it's really UTF-8. Unibyte sounds like encoding characters using always one byte per character whereas UTF-8 uses one to four bytes per character.

edited Mar 14 '18 at 20:59

answered Mar 14 '18 at 20:50

Piotrek Bzdyl

12,965
1
31
49

Yes... i think i want unibytes. I need to figure out how to get clojure to make a unibyte too. I updated with some other output too, without the special chars/font differences. – joefromct Mar 14 '18 at 21:11
I guess you can try using a different encoding in Clojure, e.g. ascii or some iso encoding. – Piotrek Bzdyl Mar 14 '18 at 21:12
And if you just want to get your decoded PDF in the file or sent over the wire (e.g. in HTTP response) why bother converting to string when you can write bytes from `Base64.Decoder.decode()` directly? – Piotrek Bzdyl Mar 14 '18 at 22:04
So what do you need to do with this data? It's rather unusual to read PDF file as a string. – Piotrek Bzdyl Mar 14 '18 at 22:14
I need to read the string, decode, and write my new hopefully valid pdf. seems to work from the command line with `decode --base64 `... ? – joefromct Mar 14 '18 at 22:26

Shlomi · Answer 3 · 2018-03-14T21:43:25.643

Update

@glts made a correct point in his comment to the question. If we go to http://www.utilities-online.info/base64/ (for example), and we try to decode the original string, we get a third, different result:

%PDF-1.1
%⣏Ӎ
1 0 obj
<<

However, if we try to encode the data the OP posted, we get a different Base64 string: JVBERi0xLjEKICXDosOjw4/DkwogMSAwIG9iagogPDwg, which if we run using the original decode implementation as written by the OP we get the same output:

(decode "JVBERi0xLjEKICXDosOjw4/DkwogMSAwIG9iagogPDwg")
"%PDF-1.1\n %âãÏÓ\n 1 0 obj\n << "

No need to make any conversions. I guess you should check out the encoder.

Original answer

This problem is due to java's Byte being signed.. So much fun!

When you convert it to string, it truncates all negative values to 65533, which is plain wrong:

(map long (decode "JVBERi0xLjENCiXi48/TDQoxIDAgb2JqDQo8PCAN"))

;; (37 80 68 70 45 49 46 49 13 10 37 65533 65533 65533 65533 13 10 49 32 48 32 111 98 106 13 10 60 60 32 13)

lets see what happens:

(defn decode  [to-decode]
  (let [byts           (.getBytes to-decode "UTF-8")
        decoded        (.decode (java.util.Base64/getDecoder) byts)]
    decoded))

(into [] (decode "JVBERi0xLjENCiXi48/TDQoxIDAgb2JqDQo8PCAN"))

;; [37 80 68 70 45 49 46 49 13 10 37 -30 -29 -49 -45 13 10 49 32 48 32 111 98 106 13 10 60 60 32 13]

See the negatives? lets try to fix that:

 (into [] (char-array (map #(if (neg? %) (char (+ 255 %)) (char %))(decode "JVBERi0xLjENCiXi48/TDQoxIDAgb2JqDQo8PCAN"))))


;; [\% \P \D \F \- \1 \. \1 \return \newline \% \á \â \Î \Ò \return \newline \1 \space \0 \space \o \b \j \return \newline \< \< \space \return]

And if we turn this into a string, we get what emacs gave us:

(String. (char-array (map #(if (neg? %) (char (+ 255 %)) (char %)) (decode "JVBERi0xLjENCiXi48/TDQoxIDAgb2JqDQo8PCAN"))))
;; "%PDF-1.1\r\n%áâÎÒ\r\n1 0 obj\r\n<< \r"

This explanation is wrong. Whether the byte is signed or not matters for integer arithmetic, not when you treat the byte as simply a byte (eight bits) as is done when we talk about encoded data. The ‘negative’ bytes are those that have a leading 1 bit, which (in the context in which they appear here) are simply invalid UTF-8 code units. — glts, Mar 14 '18 at 21:26
You are generally correct, however it seems that java's String constructor simply handles the negative values wrong, i.e. it turns them all to the same number, instead of treating them as the corresponding positive number. — Shlomi, Mar 14 '18 at 21:28
@glts after further searching, I believe you are correct and the encoder might be at fault somehow. thanks! — Shlomi, Mar 14 '18 at 21:44
hmm... I don't understand how the encoder could be at fault, when i can paste all of the encoded strings into emacs and decode with emacs to get a valid pdf file? However with any java option so far It doesn't work.. even the negative flipping/etc. — joefromct, Mar 14 '18 at 21:49
Sorry, but unfortunately i'm not the encoder. I'll keep at it, i'm going to try in python also and see if i get different results. — joefromct, Mar 14 '18 at 21:58

base64 decode string - emacs different than jvm?

Edit Taking emacs out of the equation:

update - this seems to work

3 Answers3

Update

Update #2

Update #3

Update

Original answer