1

The following code is used to serialize the data.

        ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
        BinaryEncoder binaryEncoder =
            EncoderFactory.get().binaryEncoder(byteArrayOutputStream, null);

        DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<>(data.getSchema());
        datumWriter.setSchema(data.getSchema());
        datumWriter.write(data, binaryEncoder);

        binaryEncoder.flush();
        byteArrayOutputStream.close();

        result = byteArrayOutputStream.toByteArray();

I used the following command

FileUtils.writeByteArrayToFile(new File("D:/sample.avro"), data);

to write avro byte array to a file. But when I try to read the same using

 File file = new File("D:/sample.avro");
        try {
          dataFileReader = new DataFileReader(file, datumReader);

        } catch (IOException exp) {
          System.out.println(exp);
          System.exit(1);
       }

it throws exception

java.io.IOException: Not a data file.
    at org.apache.avro.file.DataFileStream.initialize(DataFileStream.java:105)
    at org.apache.avro.file.DataFileReader.<init>(DataFileReader.java:97)
    at org.apache.avro.file.DataFileReader.<init>(DataFileReader.java:89)

What is the problem happening here. I refered two other similar stackoverflow questions this and this but haven't been of much help to me. Can someone help me understand this.

pacman
  • 725
  • 1
  • 9
  • 28
  • File extensions don't matter, so back your problem up to how you actually write the data – OneCricketeer Mar 20 '21 at 14:41
  • @OneCricketeer I have added how I created the serialized data. – pacman Mar 20 '21 at 18:39
  • Shouldn't you write `result` to the file? Also, why not use DataFileWriter class? – OneCricketeer Mar 21 '21 at 15:05
  • @OneCricketeer Iam dont really need to write to a file. Its an effort to see how it looks. The `result` is send to kafka topic. From what I understand the schema is not being send with the first lines of code. Only the data is send. That's why saving the data in a file after reading from kafka topic not working well, that's my guess. so `DataFileReader` is not something that effectively causing the problem. Its the missing schema in the serialized data send to kafka. What Iam not understanding is that why schema is not there in `result` which is being send to kafka topic. – pacman Mar 21 '21 at 15:52
  • the data looks like this after reading dataasdadasdthis with a few random characters between. So anaylsing this I believe only data is send to topic and not schema. – pacman Mar 21 '21 at 15:54
  • The schema is there with a GenericDatumWriter, though. But if you want an Avro file, best to use the classes meant to do so. You cannot open binary file and expect to see plaintext. My point was that `result` is an Avro byte array, so you should write it with `writeByteArrayToFile` , not `data`, which seems to be a Avro object already – OneCricketeer Mar 22 '21 at 14:30

2 Answers2

1

The actual data is encoded in the Avro binary format, but typically what's passed around is more than just the encoded data.

What most people think of an "avro file" is a format that includes the header (which has things like the writer schema) and then the actual data: https://avro.apache.org/docs/current/spec.html#Object+Container+Files. The first four bytes of an avro file should be b"Obj1" or 0x4F626A01. The error you are getting is because the binary you are trying to read as a data file doesn't start with the standard magic bytes.

Another standard format is the single object encoding: https://avro.apache.org/docs/current/spec.html#single_object_encoding. This type of binary format should start with 0xC301.

But if I had to guess, the binary you have could just be the raw serialized data without any sort of header information. Though it's hard to know for sure without knowing how the byte array that you have was created.

Scott
  • 1,799
  • 10
  • 11
  • Actually the data comes from a kafka topic and Iam trying to deserialize the data. The way I put it in a kafka topic is as a generic record after serializing it and hence the data in serialized format should contain the avro schema as well as the data. Iam trying to deserialze data without any schema since the incomming serialzied data must contain the schema and use it. Iam trying an apporach http://apache-avro.679487.n3.nabble.com/Deserialize-Avro-Object-Without-Schema-td4031983.html. So I saved the incomming serialized data from topic in a file and then tried this approach. – pacman Mar 17 '21 at 03:41
  • Also the serialized data looks like this in the topic ` dataasdadasd` this with a few square like symbols in between the letters .You have any thoughts on this. Is the serialization really not working? – pacman Mar 17 '21 at 03:43
  • 1
    Are you using something like the confluent schema registry library to put the message on Kafka? If so, those messages do not include the full schema. Instead they are more like the single object encoding where it is the serialized data with a small header containing a schema ID rather than the full schema. If instead you are using the DataFileWriter to create an actual avro file and then sending that to Kafka yourself without any schema registry library, then what you have should probably work so something else must be going on... – Scott Mar 17 '21 at 11:48
0

You'd need to utilize Avro to write the data as well as read it otherwise the schema isn't written (hence the "Not a data file" message). (see: https://cwiki.apache.org/confluence/display/AVRO/FAQ#FAQ-HowcanIserializedirectlyto/fromabytearray?)

If you're just looking to serialize an object, see: https://mkyong.com/java/how-to-read-and-write-java-object-to-a-file/

Curtis
  • 548
  • 3
  • 13