0

I have a use case where I need to be able to read user generated data into a string, and do all the normal string operations on it.

The user generated data will be base64 encoded, and I need to decode that into a string.

I know nothing about this data except that it will be of the mime-type text/plain .

I want a simple, no-brainer way of decoding this data into a string. Something that just works out of the box and I don't have to think about any edge cases.

Any ideas?

Ali
  • 261,656
  • 265
  • 575
  • 769
  • What do you want to do with the string once you've put it into a string? – Klitos Kyriacou Nov 09 '16 at 11:45
  • @KlitosKyriacou Regular string stuff. subtring, length, match, etc. – Ali Nov 09 '16 at 12:05
  • 1
    In general it's not possible to handle all possible encodings unless you know what the encoding is. E.g. if you don't know whether a string is encoded as UTF-8 or ISO-Latin-1, then you can't reliably get its length. – Klitos Kyriacou Nov 09 '16 at 13:29
  • @KlitosKyriacou Is there anything in guava, apache commons, etc which will take care of all that plumbing for me and just give me a string? – Ali Nov 09 '16 at 13:36
  • 1
    Even if you had the world's most thorough library that had every method you could think of, you still wouldn't be able to turn bytes to Unicode characters without knowing the encoding. If you can get the encoding from somewhere, then use that; otherwise, the best you can do is guess. That's what many applications do; that's why you often see funny characters when reading websites and other documents - errors caused by using the wrong encoding. – Klitos Kyriacou Nov 09 '16 at 13:38
  • @KlitosKyriacou Is there no encoding which is the superset of them all and will work for everything? – Ali Nov 09 '16 at 17:59
  • No, unfortunately if you have a message and you don't know how it's encoded, you can't reliably decode it into a Unicode string. The best you can do is look at the stream of bytes and make a best guess as to the most probable encoding based on the data. This is what some third-party libraries do. E.g. if under one encoding you find two letters that often appear together (e.g. a vowel and a consonant) while under another encoding you get characters that are unlikely to appear together, then it probably uses the first encoding. – Klitos Kyriacou Nov 10 '16 at 10:00
  • Having said the above, you might also consider the fact that the first 128 values from Unicode (i.e. 0 to 127) are the same as ASCII. Therefore, if all your bytes have values in the range 0 to 127, then you can be sure they use the ASCII encoding, and your string will then be correctly decoded whatever encoding you use (e.g. the default one). However, if any of the byte values are in the range 128-255, then you can't be certain how to decode it. – Klitos Kyriacou Nov 10 '16 at 10:05

1 Answers1

1

If you know which is the charset of the original string you could use this method:

 public static String fromBase64ToString(String base64String, Charset c ){
        byte[] b = Base64.getDecoder().decode(base64String.getBytes(c));
        return new String(b,c);
    }

Unfortunately, there is not a java API that allows you to determine which is the charset used by a string. It seems that you can determine the encoding of a stream (if you could get the input stream you are reading from) using third part libraries like juniversalchardet, read here or here for more infos

Community
  • 1
  • 1
user6904265
  • 1,938
  • 1
  • 16
  • 21
  • That was easier than expected :) – Ali Nov 09 '16 at 12:08
  • @ClickUpvote easier than expected possibly because you're not aware of the implications. `new String(b)` uses the current system default encoding. How do you know that's the same encoding used by the MIME message? – Klitos Kyriacou Nov 09 '16 at 13:10
  • @KlitosKyriacou Yeah, I felt like something might be missing. Can you post an answer which will work across everything? – Ali Nov 09 '16 at 13:36
  • 1
    It's up to you to know what encoding the bytes are in. Are they coming in an HTTP request? Something else? – bmargulies Nov 09 '16 at 14:04
  • @bmargulies Something else – Ali Nov 09 '16 at 17:58