0

I received an XML file with a PDF Attachment in it encoded as Base64 string. I am trying to generate a PDF file out of it. Following code works well:

String base64encodedPdf =" ....   ";
byte[] imgBytes = javax.xml.bind.DatatypeConverter.parseBase64Binary(base64encodedPdf);
IOUtils.write(imgBytes, new FileOutputStream("C:\\\\test.pdf"));

Problem arises when attachment data is too big to copy to editor directly, thought I can copy it to a text file and read file and convert to String. This is how I do it:

org.apache.commons.io.FileUtils.readFileToString(file, encoding)

I am curious what encoding shall I specify... UTF-8, UTF-16 and why?

EDIT:

This is the meta-information available to me

<AttachmentType tc="1">Document</AttachmentType>
<MimeType>application/pdf</MimeType>
<TransferEncodingTypeString>Base64</TransferEncodingTypeString>
<TransferEncodingTypeTC tc="4">Base64</TransferEncodingTypeTC>
Charu Khurana
  • 4,511
  • 8
  • 47
  • 81
  • 1
    Well what encoding has the text been stored in? We can't possibly know that - hopefully you do... – Jon Skeet Nov 08 '13 at 17:33
  • that's a good question.... I added to question what meta information I have available – Charu Khurana Nov 08 '13 at 17:36
  • Base64 is used to encode "binary" data. Thus, when you decode it, and go to write the file to disk, you want to write the exact binary result, not some character encoding. It's not character data. – Hot Licks Nov 08 '13 at 17:42
  • @HotLicks sorry didn't follow you. What change are you suggesting – Charu Khurana Nov 08 '13 at 17:47
  • The only way to convert a PDF to a valid text file is with a PDF to text converter. – Hot Licks Nov 08 '13 at 18:13
  • Of course, if you just want to look at the Base64 encoded data that's plain ASCII, and any ASCII or UTF8 encoding will be fine. But a Base64-encoded file isn't much to look at. – Hot Licks Nov 08 '13 at 18:15
  • @HotLicks my intent of copying Base64 encoded data to a file is just to read that file into code to generate PDF – Charu Khurana Nov 08 '13 at 18:19
  • 1
    See [this question](http://stackoverflow.com/questions/6302544/default-encoding-for-xml-is-utf-8-or-utf-16) for how to determine what encoding your XML document is in. It'll most likely be UTF-8 but it could be something else depending on the BOM and the XML prolog. – dcsohl Nov 08 '13 at 18:39
  • @dcsohl cool.... this answer correctly points me. My XML prolog defines encoding. Thank you very much. – Charu Khurana Nov 08 '13 at 18:44
  • For pure Base64 it doesn't matter. As I said, use ASCII or UTF8. You confuse things by talking of converting "it" to string, without being clear whether you're talking about the encoded Base64 or the decoded PDF. The former is a limited ASCII character set. The latter is "pure binary" and has no "character set" associated with the file. – Hot Licks Nov 08 '13 at 22:52

1 Answers1

0

It depends on what encoding you used when writing into the text file. Java text-related IO classes such as PrintWriter has a constructor that allows you to explicitly define the encoding, eg:

new PrintWriter("foo.txt", "UTF-8");

If you don't do so, it will use the default encoding which might vary depending on platform / JVM setting. You check your platform's default encoding using

Charset.defaultCharset()

But it's a good practice to always explicitly specify your intended encoding when writing to a file

gerrytan
  • 40,313
  • 9
  • 84
  • 99