3

I have a AVRO schema which is currently in single avsc file like below. Now I want to move address record to a different common avsc file which should be referenced from many other avsc file. So Customer and address will be separate avsc files. How can I separate them and and have customer avsc file reference address avsc file. Also how would both the files can be processed using python. I am currently using fast avro in python3 to process the single avsc file but open to use any other utility in python3 or pyspark.

File name - customer_details.avsc

[
{
    "type": "record",
    "namespace": "com.company.model",
    "name": "AddressRecord",
    "fields": [
        {
            "name": "streetaddress",
            "type": "string"
        },
        {
            "name": "city",
            "type": "string"
        },
        {
            "name": "state",
            "type": "string"
        },
        {
            "name": "zip",
            "type": "string"
        }
    ]
},
{
    "namespace": "com.company.model",
    "type": "record",
    "name": "Customer",
    "fields": [
        {
            "name": "firstname",
            "type": "string"
        },
        {
            "name": "lastname",
            "type": "string"
        },
        {
            "name": "email",
            "type": "string"
        },
        {
            "name": "phone",
            "type": "string"
        },
        {
            "name": "address",
            "type": {
                "type": "array",
                "items": "com.company.model.AddressRecord"
            }
        }
    ]
}
]
import fastavro

s1 = fastavro.schema.load_schema('customer_details.avsc')

How can split the schema in different file where address record file can be referenced from other avsc file. Then how would I process multiple avsc files using fast Avro (Python) or any other python utility?

Codegator
  • 459
  • 7
  • 28

1 Answers1

2

To do this, the schema for the AddressRecord should be in a file called com.company.model.AddressRecord.avsc with the following contents:

{
    "type": "record",
    "namespace": "com.company.model",
    "name": "AddressRecord",
    "fields": [
        {
            "name": "streetaddress",
            "type": "string"
        },
        {
            "name": "city",
            "type": "string"
        },
        {
            "name": "state",
            "type": "string"
        },
        {
            "name": "zip",
            "type": "string"
        }
    ]
}

The Customer schema doesn't necessarily need a special naming convention since it is the top level schema, but it's probably a good idea to follow the same convention. So it would be in a file named com.company.model.Customer.avsc with the following contents:

{
    "namespace": "com.company.model",
    "type": "record",
    "name": "Customer",
    "fields": [
        {
            "name": "firstname",
            "type": "string"
        },
        {
            "name": "lastname",
            "type": "string"
        },
        {
            "name": "email",
            "type": "string"
        },
        {
            "name": "phone",
            "type": "string"
        },
        {
            "name": "address",
            "type": {
                "type": "array",
                "items": "com.company.model.AddressRecord"
            }
        }
    ]
}

The files must be in the same directory.

Then you should be able to do fastavro.schema.load_schema('com.company.model.Customer.avsc')

Scott
  • 1,799
  • 10
  • 11
  • I'm using confluent_kafka_python API so requiring naming conventions and all files in same directory might be too restrictive. Looks like there is an internal `_named_schemas` dict we might be able to leverage though. – Ryan Feb 04 '21 at 21:21