Avro write and read works on one machine and not on other

Question

Here is some Avro code that runs on one machine but fails on the other with an exception.

We are not able to make sure what's wrong here.

Here is the code that is causing the problem.

Class<?> clazz = obj.getClass();
ReflectData rdata = ReflectData.AllowNull.get();
Schema schema = rdata.getSchema(clazz);

ByteArrayOutputStream os = new ByteArrayOutputStream();
Encoder encoder = EncoderFactory.get().binaryEncoder(os, null);
DatumWriter<T> writer = new ReflectDatumWriter<T>(schema, rdata);
writer.write(obj, encoder);
encoder.flush();
byte[] bytes = os.toByteArray();

String binaryString = new String (bytes, "ISO-8859-1");
BinaryDecoder decoder = DecoderFactory.get().binaryDecoder(binaryString.getBytes("ISO-8859-1"), null);
GenericDatumReader<GenericRecord> datumReader = new GenericDatumReader<GenericRecord> (schema);
GenericRecord record = datumReader.read(null, decoder);

Exception is:

org.apache.avro.AvroRuntimeException: Malformed data. Length is negative: -32
at org.apache.avro.io.BinaryDecoder.doReadBytes(BinaryDecoder.java:336)
at org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:263)
at org.apache.avro.io.ValidatingDecoder.readString(ValidatingDecoder.java:107)
at org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:437)
at org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:427)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:189)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:187)
at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:263)
at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:216)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:183)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:173)

Please post the Exception including its stack trace – K Erlandsson May 12 '15 at 17:58 — K Erlandsson, May 12 '15 at 17:58
@KristofferE, I have added the exception – user2250246 May 12 '15 at 18:09 — user2250246, May 12 '15 at 18:09

score 1 · Answer 1 · answered May 12 '15 at 18:29

1

Adding Dfile.encoding=UTF-8 in the tomcat params helped us resolve the issue.

answered May 12 '15 at 18:29

user2250246

3,807
5
43
71

I'd be cautious with that explicit use of ISO-8859-1. I believe Avro writes strings as UTF-8 bytes, and 8859-1 does not match UTF-8 exactly. You may encounter characters that break your code because you are interpreting the bytes as 8859-1 before going back to bytes, which could corrupt data. – Keegan May 13 '15 at 03:09
ISO-8859-1 is used for binary encodings of numbers that could not be handled by UTF-8 byte to string conversion in the following code: new String (os.toByteArray(), "ISO-8859-1"); Do you see issues with this conversion? – user2250246 May 13 '15 at 04:44
I'll have to dig some more into Avro to double check what encoding it writes it with. It'll be a problem if it's not 8859-1. Why are you converting it to a string in the first place? – Keegan May 13 '15 at 05:03
Avro's strings are UTF-8 bytes. You can see this in `org.apache.avro.io.BinaryEncoder.writeString(String)`. Reading Avro as 8859-1 bytes I think isn't right. Avro is a binary format with it's own serialization (you can see how it encodes in `org.apache.avro.io.BinaryData`). The only times it uses 8859-1 are when reading from JSON or URLs. I strongly recommend staying in bytes and not creating an intermediate string. If you need a string for debugging, convert the Avro record to JSON, or use the toString() on your avsc object (if applicable). – Keegan May 13 '15 at 15:15

Avro write and read works on one machine and not on other

1 Answers1

Linked