1

I want to read an input string and return it as a UTF8 encoded string. SO I found an example on the Oracle/Sun website that used FileInputStream. I didn't want to read a file, but a string, so I changed it to StringBufferInputStream and used the code below. The method parameter jtext, is some Japanese text. Actually this method works great. The question is about the deprecated code. I had to put @SuppressWarnings because StringBufferInputStream is deprecated. I want to know is there a better way to get a string input stream? Is it ok just to leave it as is? I've spent so long trying to fix this problem that I don't want to change anything now I seem to have cracked it.

            @SuppressWarnings("deprecation")
    private  String readInput(String jtext) {

        StringBuffer buffer = new StringBuffer();
        try {
        StringBufferInputStream  sbis = new StringBufferInputStream (jtext);
        InputStreamReader isr = new InputStreamReader(sbis,
                                  "UTF8");
        Reader in = new BufferedReader(isr);
        int ch;
        while ((ch = in.read()) > -1) {
            buffer.append((char)ch);
        }

        in.close();
        return buffer.toString();
        } catch (IOException e) {
        e.printStackTrace();
        return null;
        }
    }

I think I found a solution - of sorts:

private  String readInput(String jtext) {

        String n;
        try {
            n = new String(jtext.getBytes("8859_1"));
            return n;
        } catch (UnsupportedEncodingException e) {

            return null;
        }
                    }

Before I was desparately using getBytes(UTF8). But I by chance I used Latin-1 "8859_1" and it worked. Why it worked, I can't fathom. This is what I did step-by-step:

OpenOffice CSV(utf8)------>SQLite(utf8, apparently)------->java encoded as Latin-1, somehow readable.

3 Answers3

6

The reason that StringBufferInputStream is deprecated is because it is fundamentally broken ... for anything other than Strings consisting entirely of Latin-1 characters. According to the javadoc it "encodes" characters by simply chopping off the top 8 bits! You don't want to use it if your application needs to handle Unicode, etc correctly.

If you want to create an InputStream from a String, then the correct way to do it is to use String.getBytes(...) to turn the String into a byte array, and then wrap that in a ByteArrayInputStream. (Make sure that you choose an appropriate encoding!).

But your sample application immediately takes the InputStream, converts it to a Reader and then adds a BufferedReader If this is your real aim, then a simpler and more efficient approach is simply this:

Reader in = new StringReader(text);

This avoids the unnecessary encoding and decoding of the String, and also the "buffer" layer which serves no useful purpose in this case.

(A buffered stream is much more efficient than an unbuffered stream if you are doing small I/O operations on a file, network or console stream. But for a stream that is served from an in-memory data structure the benefits are much smaller, and possibly even negative.)

FOLLOWUP

I realized what you are trying to do now ... work around a character encoding / decoding issue.

My advice would be to try to figure out definitively the actual encoding of the character data that is being delivered by the database, then make sure that the JDBC drivers are configured to use the same encoding. Trying to undo the mis-translation by encoding with one encoding and decoding with another is dodgy, and can give you only a partial correction of the problems.

You also need to consider the possibility that the characters got mangled on the way into the database. If this is the case, then you may be unable to de-mangle them.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
  • I wasn't sure whether I needed a buffer, but it was a cut-and-paste job from the Oracle example so I left it in. I'll try String.getBytes(...) next time I get a chance (no more Android today). Thanks for your help! –  Dec 26 '10 at 08:57
  • Unfortunately StringReader didn't work. The garbled strings didn't change. A class that changes the string into bytes would be good. So I think I'll stick with StirngBufferInptStream, until I find something better. I did find out that the DB encodes in UTF8. –  Dec 27 '10 at 04:49
  • If you know that the DB is encoding the characters as UTF-8, you now need to make sure that the JDBC driver (or whatever) is configured to expect UTF-8. Then you won't need to stuff around trying to "fix" the Strings by recoding them. – Stephen C Dec 27 '10 at 05:02
  • Well, that's where things get difficult. (I'M using Android with SQLite). The problem is SQlite says it is storing my Japanese text in UTF8, but it is garbled in the database. English text is not garbled, even if the same CSV contains English and Japanese text. But I'm tempted to leave it where it is, because I've double, triple checked the Japanese text after SBIS and it is all fine. –  Dec 27 '10 at 05:11
2

Is this what you are trying to do? Here is previous answer on similar question. I am not sure why you want to convert to a String to an exactly the same String.

Java String holds a sequence of chars in which each char represents a Unicode number. So it is possible to construct the same string from two different byte sequences, says one is encoded with UTF-8 and the other is encoded with US-ASCII.

If you want to write it to file, you can always convert it with String.getBytes("encoder");

private static String readInput(String jtext) {
    byte[] bytes = jtext.getBytes();
    try {
        String string = new String(bytes, "UTF-8");
        return string;
    } catch (UnsupportedEncodingException ex) {
        // do something
        return null;
    }
}

Update

Here is my assumption.

According to your comment, you SQLite DB store text value using one encoding, says UTF-16. For some reason, your SQLite APi cannot determine what the encoding it uses to encode the Unicode values to sequence of bytes.

So when you use getString method from your SQLite API, it reads a set of bytes form you DB, and convert them into Java String using incorrect encoding. If this is the case, you should use getBytes method and reconstruct the String yourself, i.e. new String(bytes, "encoding used in your DB"); If you DB is stored in UTF-16, then new String(bytes, "UTF-16"); should be readable.

Update

I wasn't talking about getBytes method on String class. I talked about getBytes method on your SQL result object, e.g. result.getBytes(String columnLabel).

ResultSet result = .... // from SQL query
String readableString = readInput(result.getBytes("my_table_column"));

You will need to change the signature of your readInput method to

private static String readInput(byte[] bytes) {
    try {
        // change encoding to your DB encoding.
        // this can be UTF-8, UTF-16, 8859_1, etc.
        String string = new String(bytes, "UTF-8");
        return string;
    } catch (UnsupportedEncodingException ex) {
        // do something, at least return garbled text
        return new String(bytes, "UTF-8");;
    }
}

Whatever encoding you set in here which makes your String readable, it is definitely the encoding of your column in DB. This involves no unexplanable phenomenon and you know exactly what your column encoding is.

But it will be good to config your JDBC driver to use the correct encoding so that you will not need to use this readInput method to convert.

If no encoding can make your string readable, you will need consider the possibility of the characters got mangled when it was written to DB as @Stephen C said. If this is the case, using walk around method may cause you to lose some of the charaters during conversions. You will also need to solve encoding problem during writting as well.

Community
  • 1
  • 1
gigadot
  • 8,879
  • 7
  • 35
  • 51
  • The string is Japanese text, and garbled and unreadable. It was taken from a SQlite db, which is where my problem was. For some reason sqlite didn't want to read Japanese text, even though I formatted the input in UTF8. I came to the conclusion that the db wasn't storing the text as UTF8, but as something else, possibly UTF16. So when I read the string from the DB (I should mention Im using Android Java) it was garbled. So I decided to reformat it in java/android in UTF8, hey presto it worked! Maybe there was some wishful thinking and dark magic involved but it worked! –  Dec 26 '10 at 05:03
  • Character encoding is a way to map sequence of bytes to Unicode values. If your DB stores sequence of bytes using UTF16, then when you read it to a String, you will need to specify UTF-16 as the encoding. If I understand correctly, your SQLite API return a String but it doesn't set the encoding correctly which is why when you reencode it works. Can you try getting bytes from your DB and the use new String(bytes, "encoding used in your db"); to construct your String, i.e. don't use your SQLite getString method. – gigadot Dec 26 '10 at 05:23
  • I don't know the encoding used in the DB. It can't be UTF8 because its garbled. But sqlite only knows UTF8 and UTF16, so I guess it must be UTF16. Problem is OpenOffice can only format CSV files (for DB import) as UTF7/8. –  Dec 26 '10 at 08:55
  • I tried "new String(bytes, "UTF-16");" but the text is garbled. In fact getBytes() doesn't work with either UTF-8 or UTF-16. I'm thinking I should stick with StirngBufferINputStream, deprecated or not. –  Dec 27 '10 at 04:19
  • @JJG - `StringBufferInputStream` won't "work" unless the characters are Latin-1 – Stephen C Dec 27 '10 at 05:04
  • If you are interested, please see my edited OP for my "solution". –  Dec 27 '10 at 05:39
0

The StringReader class is the new alternative to the deprecated StringBufferInputStream class.

However, you state that what you actually want to do is take an existing String and return it encoded as UTF-8. You should be able to do that much more simply I expect. Something like:

s8 = new String(jtext.getBytes("UTF8"));
gavinb
  • 19,278
  • 3
  • 45
  • 60
  • Thanks for your help. I tried what you said, but unfortunately the text is garbled if I use getBytes(). I don't know why. –  Dec 27 '10 at 04:11
  • @JJG the answer by gigadot has more detail on this aspect of the question. – gavinb Dec 28 '10 at 10:08