8

I need to convert the content of an InputStream into a String. The difficulty here is the input encoding, namely Latin-1. I tried several approaches and code snippets with String, getBytes, char[], etc. in order to get the encoding straight, but nothing seemed to work.

Finally, I came up with the working solution below. However, this code seems a little verbose to me, even for Java. So the question here is:

Is there a simpler and more elegant approach to achieve what is done here?

private String convertStreamToStringLatin1(java.io.InputStream is)
        throws IOException {

    String text = "";

    // setup readers with Latin-1 (ISO 8859-1) encoding
    BufferedReader i = new BufferedReader(new InputStreamReader(is, "8859_1"));

    int numBytes;
    CharBuffer buf = CharBuffer.allocate(512);
    while ((numBytes = i.read(buf)) != -1) {
        text += String.copyValueOf(buf.array(), 0, numBytes);
        buf.clear();
    }

    return text;
}
cyroxx
  • 3,809
  • 3
  • 23
  • 35

5 Answers5

7

Firstly, a few criticisms of the approach you've taken already. You shouldn't unnecessarily use an NIO CharBuffer when you merely want a char[512]. You don't need to clear the buffer each iteration, either.

int numBytes;
final char[] buf = new char[512];
while ((numBytes = i.read(buf)) != -1) {
    text += String.copyValueOf(buf, 0, numBytes);
}

You should also know that just constructing a String with those arguments will have the same effect, as the constructor too copies the data.

The contents of the subarray are copied; subsequent modification of the character array does not affect the newly created string.


You can use a dynamic ByteArrayOutputStream which grows an internal buffer to accommodate all the data. You can then use the entire byte[] from toByteArray to decode into a String.

The advantage is that deferring decoding until the end avoids decoding fragments individually; while that may work for simple charsets like ASCII or ISO-8859-1, it will not work on multi-byte schemes like UTF-8 and UTF-16. This means it is easier to change the character encoding in the future, since the code requires no modification.

private static final String DEFAULT_ENCODING = "ISO-8859-1";

public static final String convert(final InputStream in) throws IOException {
  return convert(in, DEFAULT_ENCODING);
}

public static final String convert(final InputStream in, final String encoding) throws IOException {
  final ByteArrayOutputStream out = new ByteArrayOutputStream();
  final byte[] buf = new byte[2048];
  int rd;
  while ((rd = in.read(buf, 0, 2048) >= 0) {
    out.write(buf, 0, rd);
  }
  return new String(out.toByteArray(), 0, encoding);
}
obataku
  • 29,212
  • 3
  • 44
  • 57
  • Thank you for your critical comment. Your first solution was like what I was looking for. However, I can see your point with your second solution which very much addresses the general case. I guess this is also why the buffer size is 2048 bytes in your example? – cyroxx Aug 12 '12 at 23:42
  • The 2048-byte buffer was just personal preference; you could use whatever provides a reasonable trade-off for run-time and memory consumption. – obataku Aug 13 '12 at 09:54
3

I don't see how it could be much simpler. I did this a little different once.. if you already have a String, you can do this:

new String(originalString.getBytes(), "ISO-8859-1");

So something like this could also work:

BufferedReader reader = new BufferedReader(new InputStreamReader(is));
StringBuilder sb = new StringBuilder();
String line = null;
while ((line = reader.readLine()) != null) {
  sb.append(line + "\n");
}
is.close();
return new String(sb.toString().getBytes(), "ISO-8859-1");

EDIT: I should add, this is really just an alternative to your already working solution. When it comes to converting Streams in Java it won't be much simpler, so go for it. :)

Blacklight
  • 3,809
  • 2
  • 33
  • 39
  • There are many improvements here. Firstly, this will not produce the exact text in the case that no line terminator is found by `reader.readLine`; it will append a trailing `\n` that was not there originally. In addition, `BufferedReader` will automatically use the default system encoding. It is a better idea to just construct the [`InputStreamReader`](http://goo.gl/mhzP1) as using `StandardCharsets.ISO_8859_1`, so then you can just use `StringBuilder.toString` in one step to acquire the correctly decoded string. – obataku Aug 07 '12 at 22:44
  • 1
    About the \n: I take that improvement thanks, I wasn't really paying attention to the InputStream->String conversion, it was just to complete the example. The different way of handling the encoding is still ok imho, there are many ways too Rome. ;-) But like I said it's just an alternative. Any utilities like commonsIO clean up the code, do essentially the same though and depend on an additional library. Makes sense if you make use of it more often.. a matter of personal choice. – Blacklight Aug 08 '12 at 06:42
1

I just found out that this answer to the question Read/convert an InputStream to a String can be applied to my problem, please see the code below. Anyway, I very much appreciate the answers you've given so far.

private String convertStreamToString(InputStream is, String charsetName) {
    try {
        return new java.util.Scanner(is, charsetName).useDelimiter("\\A").next();
    } catch (java.util.NoSuchElementException e) {
        return "";
    }
}

So in order to encode from Latin-1, call it like this:

String message = convertStreamToString(is, "8859_1");
Community
  • 1
  • 1
cyroxx
  • 3,809
  • 3
  • 23
  • 35
  • You should know that `Scanner` internally compiles a regex `Pattern` for the delimiter. This method is indeed interesting and nifty, but also probably not advisable. – obataku Aug 07 '12 at 22:53
  • I'd like to gain some more insight on this: What is the problem with that pattern? Shouldn't it be it rather lightweight? – cyroxx Aug 10 '12 at 17:25
  • It just seems like an interesting solution but an abuse of `Scanner`. In the answer you linked to, they put it well... a *stupid `Scanner` trick*. – obataku Aug 10 '12 at 19:27
0

If you don't want to plumb it yourself you could have a look at the apache commons io project, IOUtils.toString(InputStream input, String encoding) which seems to do what you want. I haven't tried that method myself but the java doc states "Get the contents of an InputStream as a String using the specified character encoding."

Fredrik LS
  • 1,480
  • 9
  • 15
0

Guava's IO package is really nice this way.

Files.toString(yourFile, CharSets.ISO_8859_1)

or from a stream

new String(ByteStreams.toByteArray(stream), CharSets.ISO_8859_1)
Mike Samuel
  • 118,113
  • 30
  • 216
  • 245