3

I have data in binary format (hex: 80 3b c8 87 0a 89) and I need to convert that into String in order to save binary data in MS Access db via Jackcess. I know, that I'm not suppose to use String in Java for binary data, however Access db is third party product and I have not control whatsoever.

So I tried to convert binary data and save it, but unfortunately the result was unexpected.

byte[] byteArray = new byte[] {0x80, 0x3b, 0xc8, 0x87, 0x0a 0x89};
System.out.println(String.format("%02X ",byteArray[0])+String.format("%02X ", byteArray[1]));//gives me the same values

String value = new String(byteArray, "UTF-8");//or any other encoding
System.out.println(value);//completely different values

I would like to know what going on under new String and if there is a way to convert binary data into String and have the same hex values.

Note 1: initially I read a binary file which has nothing to do with hex. I use hex just for comparison of datasets.

Note 2 There was a suggestion to use Base64 aka MIME, UTF-7, etc. By my understanding, it takes binary data and encodes that into ANSI charset, basically tweaking initial data. However,for me that is not a solution, because I must write exact data that I hold in binary array.

byte[] byteArray = new byte[]{0x2f, 0x7a, 0x2d, 0x28};
byte[]   bytesEncoded = Base64.encodeBase64(byteArray);
System.out.println("encoded value is " + new String(bytesEncoded ));//new data
Dzidas
  • 305
  • 4
  • 13
  • Hint: what does `String.valueOf(byteArray)` return? (It's not anything useful) – user253751 Jan 30 '15 at 10:27
  • Did you try `String value = new String(byteArray, "UTF-8");` – mr.icetea Jan 30 '15 at 10:28
  • @mr.icetea: That's really not going to work. UTF-8 isn't hex... – Jon Skeet Jan 30 '15 at 10:33
  • @JonSkeet ok i got it, so i found this http://stackoverflow.com/questions/140131/convert-a-string-representation-of-a-hex-dump-to-a-byte-array-using-java could be a solution for OP – mr.icetea Jan 30 '15 at 10:35
  • @mr.icetea I tried ANSI, UTF-8,16 and the values (in hex) are not the same. – Dzidas Jan 30 '15 at 10:37
  • @mr.icetea: Had misread the question, in fact. But basically, trying to convert to a string this way is a really bad idea... the question you refer to would work, although I'd personally just use Guava. – Jon Skeet Jan 30 '15 at 10:39

2 Answers2

4

In order to safely convert arbitrary binary data into text, you should use something like hex or base64. Encodings such as UTF-8 are meant to encode arbitrary text data as bytes, not to encode arbitrary binary data as text. It's a difference in terms of what the source data is.

I would strongly recommend using a library for this. For example, with Guava:

String hex = BaseEncoding.base16().encode(byteArray);
// Store hex in the database in the text field...
...
// Get hex from the database from the text field...
byte[] binary = BaseEncoding.base16().decode(hex);

(Other libraries are available, of course, such as Apache Commons Codec.)

Alternatively, save your binary data into a field in Access which is designed for binary data, instead of converting it to text at all.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • No, I do not convert data into hex. I use hex to confirm, that I have the same data across datasets. – Dzidas Jan 30 '15 at 10:35
  • @Dzidas: Well you *should* convert the data into hex - or base64, or something like that. Editing... – Jon Skeet Jan 30 '15 at 10:37
  • I know, that I'm suppose to use MIME for such data, however I'm trying to fit into third party design and I have no power to change it. – Dzidas Jan 30 '15 at 10:45
  • @Dzidas: I don't see what MIME has to do with this at all. Is that some extra requirement from somewhere that you'd just failed to mention? The code I've provided will reversibly convert any binary data into text. – Jon Skeet Jan 30 '15 at 10:46
  • I'm trying to generate index field for 3rd party db. Index field is text and has binary data, something like `䠎ℎ餎餎討ℎāā`. My software reads values from the file in Java as binary data, converts into String and saves it into Access. However, during conversion to String data is changed and becomes incorrect.I suppose. that base64 would change data as well. – Dzidas Jan 30 '15 at 10:55
  • @Dzidas: "Index field is text and has binary data" makes no sense. Either it's text, or it's binary. It can't be both. It could be binary data encoded as text, in which case it looks like it's broken to start with. Note that your question doesn't explain *any* of this... your question supposes that you're *starting* with a byte array. – Jon Skeet Jan 30 '15 at 10:58
  • I did a test with base64 and as I said in MIME comment, it doesn't work as I need - Note2 – Dzidas Jan 30 '15 at 12:12
  • @Dzidas: Base64 and MIME are *entirely* different things... and it's not clear what you mean by it not working, as you can retrieve the exact byte array back from the string later. You appear to be slightly confused by your own requirements here. – Jon Skeet Jan 30 '15 at 12:20
  • According wikipedia `Base64 is a group of similar binary-to-text encoding schemes that represent binary data in an ASCII string format by translating it into a radix-64 representation` and MIME is one of implementation. Nevertheless, I did a test as an example above and initial data was change after transformation (as to be expected). – Dzidas Jan 30 '15 at 12:31
  • @Dzidas: MIME is a much more general concept than that. One of the base64 variants may be called MIME because that's what's used *within* MIME, but that's far from all of what MIME is. But again, with either base64 or hex, you start with a byte array, convert it to a string, and can then convert the string *back* to a byte array to get the same data. There *won't* be any change there. It's not clear what you're expecting to see. – Jon Skeet Jan 30 '15 at 12:32
  • 2
    @Dzidas re: "Index field is text and has binary data, something like `䠎ℎ餎餎討ℎāā`" - How are you determining what the index field currently contains? If characters such as `䠎` and `討` appear when you open the database in Access itself then they are almost certainly Unicode characters. If that's the case, then your scheme to convert single bytes to single characters won't work because Unicode characters (however they are encoded) can result in more than one byte per character. – Gord Thompson Jan 30 '15 at 12:39
1

The basic lesson to be taken - never mix up binary data with String equivalent.

My mistake was, that I exported initial data from Access into csv, while changing type of the index field from binary to String (total mess, now I know). The solution that I came - my own export tool from Access, where all data is kept as binary. Thanks to @gord-thompson - his comment led to the solution.

Dzidas
  • 305
  • 4
  • 13