Questions tagged [avro]

Apache Avro is a data serialization framework primarily used in Apache Hadoop.

Apache Avro is a data serialization system.

Features:

  • Rich data structures.
  • A compact, fast, binary data format.
  • A container file, to store persistent data.
  • Remote procedure call (RPC).
  • Simple integration with dynamic languages. Code generation is not required to read or write data files nor to use or implement RPC protocols. Code generation as an optional optimization, only worth implementing for statically typed languages.

Schemas:

Avro relies on schemas. When Avro data is read, the schema used when writing it is always present. This permits each datum to be written with no per-value overheads, making serialization both fast and small. This also facilitates use with dynamic, scripting languages, since data, together with its schema, is fully self-describing.

When Avro data is stored in a file, its schema is stored with it, so that files may be processed later by any program. If the program reading the data expects a different schema this can be easily resolved, since both schemas are present.

When Avro is used in RPC, the client and server exchange schemas in the connection handshake. (This can be optimized so that, for most calls, no schemas are actually transmitted.) Since both client and server both have the other's full schema, correspondence between same named fields, missing fields, extra fields, etc. can all be easily resolved.

Avro schemas are defined with JSON . This facilitates implementation in languages that already have JSON libraries.

Comparison with other systems:

Avro provides functionality similar to systems such as Thrift, Protocol Buffers, etc. Avro differs from these systems in the following fundamental aspects.

  • Dynamic typing: Avro does not require that code be generated. Data is always accompanied by a schema that permits full processing of that data without code generation, static datatypes, etc. This facilitates construction of generic data-processing systems and languages.
  • Untagged data: Since the schema is present when data is read, considerably less type information need be encoded with data, resulting in smaller serialization size.
  • No manually-assigned field IDs: When a schema changes, both the old and new schema are always present when processing data, so differences may be resolved symbolically, using field names.

Available Languages: - - - - - - - -

Official Website: http://avro.apache.org/

Useful Links:

3646 questions
204
votes
5 answers

What are the pros and cons of parquet format compared to other formats?

Characteristics of Apache Parquet are : Self-describing Columnar format Language-independent In comparison to Avro, Sequence Files, RC File etc. I want an overview of the formats. I have already read : How Impala Works with Hadoop File Formats ,…
Ani Menon
  • 27,209
  • 16
  • 105
  • 126
133
votes
6 answers

What are the key differences between Apache Thrift, Google Protocol Buffers, MessagePack, ASN.1 and Apache Avro?

All of these provide binary serialization, RPC frameworks and IDL. I'm interested in key differences between them and characteristics (performance, ease of use, programming languages support). If you know any other similar technologies, please…
andreypopp
  • 6,887
  • 5
  • 26
  • 26
130
votes
6 answers

Avro vs. Parquet

I'm planning to use one of the hadoop file format for my hadoop related project. I understand parquet is efficient for column based query and avro for full scan or when we need all the columns data! Before I proceed and choose one of the file…
Abhishek
  • 6,912
  • 14
  • 59
  • 85
70
votes
4 answers

Avro field default values

I am running into some issues setting up default values for Avro fields. I have a simple schema as given below: data.avsc: { "namespace":"test", "type":"record", "name":"Data", "fields":[ { "name": "id", "type": [ "long", "null" ] }, {…
Kesh
  • 1,077
  • 2
  • 11
  • 20
52
votes
3 answers

How to create schema containing list of objects using Avro?

Does anyone knows how to create Avro schema which contains list of objects of some class? I want my generated classes to look like below : class Child { String name; } class Parent { list children; } For this, I have written part of…
Shekhar
  • 11,438
  • 36
  • 130
  • 186
50
votes
4 answers

Thrift, Avro, Protocolbuffers - Are they all dead?

Working on a pet project (cassandra, spark, hadoop, kafka) I need a data serialization framework. Checking out the common three frameworks - namely Thrift, Avro and Protocolbuffers - I noticed most of them seem to be dead-alive having 2 minor…
dominik
  • 613
  • 2
  • 6
  • 10
49
votes
2 answers

Schema evolution in parquet format

Currently we are using Avro data format in production. Out of several good points using Avro, we know that it is good in schema evolution. Now we are evaluating Parquet format because of its efficiency while reading random columns. So before moving…
ToBeSparkShark
  • 641
  • 2
  • 6
  • 10
47
votes
8 answers

Confluent Maven repository not working?

I need to use the Confluent kafka-avro-serializer Maven artifact. From the official guide I should add this repository to my Maven pom confluent http://packages.confluent.io/maven/ The problem is…
gvdm
  • 3,006
  • 5
  • 35
  • 73
43
votes
2 answers

Is it possible to have an optional field in an Avro schema (i.e. the field does not appear at all in the .json file)?

Is it possible to have an optional field in an Avro schema (i.e. the field does not appear at all in the .JSON file)? In my Avro schema, I have two fields: {"name": "author", "type": ["null", "string"], "default": null}, {"name": "importance",…
Emre Sevinç
  • 8,211
  • 14
  • 64
  • 105
43
votes
2 answers

How to nest records in an Avro schema?

I'm trying to get Python to parse Avro schemas such as the following... from avro import schema mySchema = """ { "name": "person", "type": "record", "fields": [ {"name": "firstname", "type": "string"}, {"name":…
Jorge Aranda
  • 2,050
  • 2
  • 20
  • 29
39
votes
6 answers

Can I split an Apache Avro schema across multiple files?

I can do, { "type": "record", "name": "Foo", "fields": [ {"name": "bar", "type": { "type": "record", "name": "Bar", "fields": [ ] }} ] } and that works fine, but supposing I want…
Owen
  • 38,836
  • 14
  • 95
  • 125
34
votes
5 answers

Kafka schema registry not compatible in the same topic

I'm using Kafka schema registry for producing/consuming Kafka messages, for example I have two fields they are both string type, the pseudo schema as following: {"name": "test1", "type": "string"} {"name": "test2", "type": "string"} but after…
Jack
  • 5,540
  • 13
  • 65
  • 113
34
votes
2 answers

How to generate fields of type String instead of CharSequence using Avro?

I wrote one Avro schema in which some of the fields ** need to be ** of type String but Avro has generated those fields of type CharSequence. I am not able to find any way to tell Avro to make those fields of type String. I tried to use "fields":…
Shekhar
  • 11,438
  • 36
  • 130
  • 186
33
votes
3 answers

Generate Avro Schema from certain Java Object

Apache Avro provides a compact, fast, binary data format, rich data structure for serialization. However, it requires user to define a schema (in JSON) for object which need to be serialized. In some case, this can not be possible (e.g: the class…
Richard Le
  • 639
  • 2
  • 9
  • 20
32
votes
5 answers

How to fix Expected start-union. Got VALUE_NUMBER_INT when converting JSON to Avro on the command line?

I'm trying to validate a JSON file using an Avro schema and write the corresponding Avro file. First, I've defined the following Avro schema named user.avsc: {"namespace": "example.avro", "type": "record", "name": "user", "fields": [ …
Emre Sevinç
  • 8,211
  • 14
  • 64
  • 105
1
2 3
99 100