UTF-8 - I don't understand this byte sequence

Question

I have a data provider that send me some data supposed to be coded with UTF-8. This data presents this sequence of bytes:

28 49 4e 54 e2 80 99 4c 29 20  (INTâ€™L) => "(INT’L)"

For some reason when my java program fetch this data and store it in the database, the above sequence becomes:

28 49 4e 54 19 4c 29 20        (INT.L) => "(INT\u0019L)"

The java program is built on top of hibernate. It first fetches the data from the provider, stores it in an entity and then this entity is persisted in the database (Postgresql).

Why am I loosing the bytes (e2 80 99 becomes 19) ?
How can I avoid this ?

Here is the core method used to transfer the data fetched from the provider to the entity:

import java.sql.Clob;

//...

public static String convertStreamToString(Clob clob) throws SQLException {
    if (clob == null) {
        return "";
    }

    BufferedReader br = null;
    StringBuilder result = new StringBuilder();

    try {
        br = new BufferedReader(new InputStreamReader(clob.getAsciiStream(), Charset.forName("UTF-8")));
        String lig;
        int n = 0;
        while ((lig = br.readLine()) != null) {
            if (n > 0) {
                result.append("\n");
            }
            result.append(lig);
            n++;
        }
    } catch (IOException ioe) {
         // Exception handling code ...
    } catch (SQLException sqlex) {
         // Exception handling code ...
    } finally {
        IOUtil.close(br);
    }

    return result.toString();
}

// ...

MyEntity entity = ...
oracle.sql.NCLOB clob = ...
entity.setProperty(convertStreamToString(clob));


@Entity
class MyEntity {
     @Column(name="prop", length=100000)
     private String prop;  

     public void setProperty(String value) {
          this.prop=value;
     }
}

I want to see how you receive the data from your provider and how you enter it into Hibernate — Brian Agnew, Aug 16 '13 at 09:52
Some component that you have strips the high byte from the characters: you would expect to get `\u2019` instead of `\u0019`. As for what causes this, we need to know more about how the data is put in the database and how you get it back out again. I'm pretty confident Hibernate is not a problem. — Joni, Aug 16 '13 at 10:14
Unless you can give us more details, you are unlikely to get anything other than general guesses. — Stephen C, Aug 16 '13 at 10:14
@Joni It has been a nightmare for months for storing Oracle NCLOB to Postgresql with Hibernate... (we are fighting this bug: https://hibernate.atlassian.net/browse/HHH-7035) — Stephan, Aug 16 '13 at 11:06

score 4 · Accepted Answer · answered Aug 16 '13 at 11:12

4

You are using the getAsciiStream() to read the contents of a CLOB. As the name says, this method is usable only for ASCII; it breaks all non-ASCII characters.

Use the getCharacterStream method instead.

BufferedReader br = null;
StringBuilder result = new StringBuilder();

try {
    br = new BufferedReader(clob.getCharacterStream());
    ....

answered Aug 16 '13 at 11:12

Joni

108,737
14
143
193

Well I must admit that the problem came from our workaround... but this workaround came from an Hibernate bug (https://hibernate.atlassian.net/browse/HHH-7035) ! – Stephan Aug 16 '13 at 11:42

score 0 · Answer 2 · answered Aug 16 '13 at 09:49

0

I think DB column type should support UTF-8. In Oracle this is NVARCHAR, in Postgresql it is something like this

create table test (
    utf8fld varchar(50)
);

answered Aug 16 '13 at 09:49

Evgeniy Dorofeev

133,369
30
199
275

The data comes from an Oracle database where it is stored as `NCLOB`. This data is stored in Postgresql as `character varying(100000)`. Any impact ? – Stephan Aug 16 '13 at 10:21
I mean if you have VARCHAR or CLOB column and you write a Unicode character like Ё it may be distorted, while NVARCHAR or NCLOB can store any character – Evgeniy Dorofeev Aug 16 '13 at 12:01

UTF-8 - I don't understand this byte sequence

2 Answers2