32

I'm trying to validate a JSON file using an Avro schema and write the corresponding Avro file. First, I've defined the following Avro schema named user.avsc:

{"namespace": "example.avro",
 "type": "record",
 "name": "user",
 "fields": [
     {"name": "name", "type": "string"},
     {"name": "favorite_number",  "type": ["int", "null"]},
     {"name": "favorite_color", "type": ["string", "null"]}
 ]
}

Then created a user.json file:

{"name": "Alyssa", "favorite_number": 256, "favorite_color": null}

And then tried to run:

java -jar ~/bin/avro-tools-1.7.7.jar fromjson --schema-file user.avsc user.json > user.avro

But I get the following exception:

Exception in thread "main" org.apache.avro.AvroTypeException: Expected start-union. Got VALUE_NUMBER_INT
    at org.apache.avro.io.JsonDecoder.error(JsonDecoder.java:697)
    at org.apache.avro.io.JsonDecoder.readIndex(JsonDecoder.java:441)
    at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:290)
    at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
    at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:267)
    at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:155)
    at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:193)
    at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:183)
    at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:151)
    at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142)
    at org.apache.avro.tool.DataFileWriteTool.run(DataFileWriteTool.java:99)
    at org.apache.avro.tool.Main.run(Main.java:84)
    at org.apache.avro.tool.Main.main(Main.java:73)

Am I missing something? Why do I get "Expected start-union. Got VALUE_NUMBER_INT".

Emre Sevinç
  • 8,211
  • 14
  • 64
  • 105
  • 1
    We encountered the same problem and are now using [avro-json-decoder](https://github.com/Celos/avro-json-decoder), a standalone version of [zolyfarkas' org.apache.avro.io.ExtendedJsonDecoder](https://github.com/zolyfarkas/avro), with the [following patch](https://github.com/Celos/avro-json-decoder/pull/2). – Jonah H. Harris Nov 29 '18 at 21:44

5 Answers5

66

According to the explanation by Doug Cutting,

Avro's JSON encoding requires that non-null union values be tagged with their intended type. This is because unions like ["bytes","string"] and ["int","long"] are ambiguous in JSON, the first are both encoded as JSON strings, while the second are both encoded as JSON numbers.

http://avro.apache.org/docs/current/spec.html#json_encoding

Thus your record must be encoded as:

{"name": "Alyssa", "favorite_number": {"int": 7}, "favorite_color": null}
Emre Sevinç
  • 8,211
  • 14
  • 64
  • 105
11

There is a new JSON encoder in the works that should address this common issue:

https://issues.apache.org/jira/browse/AVRO-1582

https://github.com/zolyfarkas/avro

ppearcy
  • 2,732
  • 19
  • 21
5

As @Emre-Sevinc has pointed out, the issue is with the encoding of your Avro record.

To be more specific here;

Don't do this:

   jsonRecord = avroGenericRecord.toString

Instead, do this:

    val writer = new GenericDatumWriter[GenericRecord](avroSchema)
    val baos = new ByteArrayOutputStream
    val jsonEncoder = EncoderFactory.get.jsonEncoder(avroSchema, baos)
    writer.write(avroGenericRecord, jsonEncoder)
    jsonEncoder.flush

    val jsonRecord = baos.toString("UTF-8")

You'll also need following imports:

import org.apache.avro.Schema
import org.apache.avro.generic.{GenericData, GenericDatumReader, GenericDatumWriter, GenericRecord}
import org.apache.avro.io.{DecoderFactory, EncoderFactory}

After you do this, you'll get jsonRecord with non-null union values tagged with their intended type.

Hope this helps !

Abhinandan Dubey
  • 655
  • 2
  • 9
  • 15
  • Thank you very much. This is what I was missing. I was about to write Jackson but then I looked at you example, and worked like a charm :) – Hasasn Mar 09 '23 at 16:12
3

I have implemented union and its validation , just create a union schema and pass its values through postman . resgistry url is the url which you specify for properties of kafka , u also can pass dynamic values to your schema

RestTemplate template = new RestTemplate();
        HttpHeaders headers = new HttpHeaders();
        headers.setContentType(MediaType.APPLICATION_JSON);
        HttpEntity<String> entity = new HttpEntity<String>(headers);
        ResponseEntity<String> response = template.exchange(""+registryUrl+"/subjects/"+topic+"/versions/"+version+"", HttpMethod.GET, entity, String.class);
        String responseData = response.getBody();
        JSONObject jsonObject = new JSONObject(responseData);
        JSONObject jsonObjectResult = new JSONObject(jsonResult);
        String getData = jsonObject.get("schema").toString();
        Schema.Parser parser = new Schema.Parser();
        Schema schema = parser.parse(getData);
        GenericRecord genericRecord = new GenericData.Record(schema);
        schema.getFields().stream().forEach(field->{
            genericRecord.put(field.name(),jsonObjectResult.get(field.name()));
        });
        GenericDatumReader<GenericRecord>reader = new GenericDatumReader<GenericRecord>(schema);
        boolean data = reader.getData().validate(schema,genericRecord );
Tanmay Naik
  • 586
  • 1
  • 4
  • 16
  • Note: `schema-registry-client` could be included as a dependency to call `getSchemaByVersion` – OneCricketeer Oct 18 '19 at 04:28
  • Is there any other way for creating a generic record ? When I try to send this generic record to kafka it fails on java.lang.ClassCastException: org.json.JSONObject cannot be cast to org.apache.avro.generic.IndexedRecord – Sucheth Shivakumar Nov 14 '22 at 23:11
  • @SuchethShivakumar Then something might be missing in your code. you need to first register the schema – Tanmay Naik Nov 15 '22 at 05:34
2

To expand Emre Sevinç answer

A more complex case of union:

schema:

{
    "type": "record",
    "name": "Type1",
    "fields": [{
        "name": "field1",
            "type": ["null", {
                    "type": "record",
                    "name": "Type2",
                    "fields": [{
                            "name": "field2",
                            "type": ["null", "string"]
                        }]
                }]
        }]
}

does not validate:

{
    "field1": {
            "field2": "somestring"
        
    }
}

does not validate:

{
    "field1": {
            "field2": {"string":"somestring"}
        
    }
}

validates:

{
    "field1": {
            "Type2": {"field2": {"string":"somestring"}}
        
    }
}

Marinos An
  • 9,481
  • 6
  • 63
  • 96