Questions tagged [avro]

Apache Avro is a data serialization framework primarily used in Apache Hadoop.

Apache Avro is a data serialization system.

Features:

Rich data structures.
A compact, fast, binary data format.
A container file, to store persistent data.
Remote procedure call (RPC).
Simple integration with dynamic languages. Code generation is not required to read or write data files nor to use or implement RPC protocols. Code generation as an optional optimization, only worth implementing for statically typed languages.

Schemas:

Avro relies on schemas. When Avro data is read, the schema used when writing it is always present. This permits each datum to be written with no per-value overheads, making serialization both fast and small. This also facilitates use with dynamic, scripting languages, since data, together with its schema, is fully self-describing.

When Avro data is stored in a file, its schema is stored with it, so that files may be processed later by any program. If the program reading the data expects a different schema this can be easily resolved, since both schemas are present.

When Avro is used in RPC, the client and server exchange schemas in the connection handshake. (This can be optimized so that, for most calls, no schemas are actually transmitted.) Since both client and server both have the other's full schema, correspondence between same named fields, missing fields, extra fields, etc. can all be easily resolved.

Avro schemas are defined with JSON . This facilitates implementation in languages that already have JSON libraries.

Comparison with other systems:

Avro provides functionality similar to systems such as Thrift, Protocol Buffers, etc. Avro differs from these systems in the following fundamental aspects.

Dynamic typing: Avro does not require that code be generated. Data is always accompanied by a schema that permits full processing of that data without code generation, static datatypes, etc. This facilitates construction of generic data-processing systems and languages.
Untagged data: Since the schema is present when data is read, considerably less type information need be encoded with data, resulting in smaller serialization size.
No manually-assigned field IDs: When a schema changes, both the old and new schema are always present when processing data, so differences may be resolved symbolically, using field names.

Available Languages: c - c++ - c# - java - javascript - julia - php - python - ruby

Official Website: http://avro.apache.org/

Useful Links:

Documentation
Getting Started (Java)
Getting Started (Python)
Specification
API Documentation:
- Java
- C
- C++
- C#
- Julia
IDL Language

3646 questions

204

votes

5 answers

What are the pros and cons of parquet format compared to other formats?

Characteristics of Apache Parquet are : Self-describing Columnar format Language-independent In comparison to Avro, Sequence Files, RC File etc. I want an overview of the formats. I have already read : How Impala Works with Hadoop File Formats ,…

file hadoop hdfs avro parquet

asked Apr 24 '16 at 10:59

Ani Menon

27,209
16
105
126

133

votes

6 answers

What are the key differences between Apache Thrift, Google Protocol Buffers, MessagePack, ASN.1 and Apache Avro?

All of these provide binary serialization, RPC frameworks and IDL. I'm interested in key differences between them and characteristics (performance, ease of use, programming languages support). If you know any other similar technologies, please…

protocol-buffers thrift asn.1 avro

asked Jan 08 '11 at 11:20

andreypopp

6,887
5
26
26

130

votes

6 answers

Avro vs. Parquet

I'm planning to use one of the hadoop file format for my hadoop related project. I understand parquet is efficient for column based query and avro for full scan or when we need all the columns data! Before I proceed and choose one of the file…

hadoop avro parquet

asked Mar 10 '15 at 06:19

Abhishek

6,912
14
59
85

votes

4 answers

Avro field default values

I am running into some issues setting up default values for Avro fields. I have a simple schema as given below: data.avsc: { "namespace":"test", "type":"record", "name":"Data", "fields":[ { "name": "id", "type": [ "long", "null" ] }, {…

java maven avro

asked Apr 08 '14 at 13:10

Kesh

1,077
2
11
20

votes

3 answers

How to create schema containing list of objects using Avro?

Does anyone knows how to create Avro schema which contains list of objects of some class? I want my generated classes to look like below : class Child { String name; } class Parent { list children; } For this, I have written part of…

java schema avro

asked Aug 01 '14 at 09:14

Shekhar

11,438
36
130
186

votes

4 answers

Thrift, Avro, Protocolbuffers - Are they all dead?

Working on a pet project (cassandra, spark, hadoop, kafka) I need a data serialization framework. Checking out the common three frameworks - namely Thrift, Avro and Protocolbuffers - I noticed most of them seem to be dead-alive having 2 minor…

hadoop serialization protocol-buffers thrift avro

asked Dec 05 '16 at 06:26

dominik

votes

2 answers

Schema evolution in parquet format

Currently we are using Avro data format in production. Out of several good points using Avro, we know that it is good in schema evolution. Now we are evaluating Parquet format because of its efficiency while reading random columns. So before moving…

apache-spark hadoop data-warehouse avro parquet

asked Jun 05 '16 at 17:15

ToBeSparkShark

votes

8 answers

Confluent Maven repository not working?

I need to use the Confluent kafka-avro-serializer Maven artifact. From the official guide I should add this repository to my Maven pom confluent http://packages.confluent.io/maven/ The problem is…

maven apache-kafka avro confluent-platform

asked Apr 19 '17 at 07:10

gvdm

3,006
5
35
73

votes

2 answers

Is it possible to have an optional field in an Avro schema (i.e. the field does not appear at all in the .json file)?

Is it possible to have an optional field in an Avro schema (i.e. the field does not appear at all in the .JSON file)? In my Avro schema, I have two fields: {"name": "author", "type": ["null", "string"], "default": null}, {"name": "importance",…

json avro

asked Mar 27 '15 at 11:25

Emre Sevinç

8,211
14
64
105

votes

2 answers

How to nest records in an Avro schema?

I'm trying to get Python to parse Avro schemas such as the following... from avro import schema mySchema = """ { "name": "person", "type": "record", "fields": [ {"name": "firstname", "type": "string"}, {"name":…

python avro

asked Aug 01 '12 at 17:16

Jorge Aranda

2,050
2
20
29

votes

6 answers

Can I split an Apache Avro schema across multiple files?

I can do, { "type": "record", "name": "Foo", "fields": [ {"name": "bar", "type": { "type": "record", "name": "Bar", "fields": [ ] }} ] } and that works fine, but supposing I want…

avro

asked Feb 03 '14 at 22:24

Owen

38,836
14
95
125

votes

5 answers

Kafka schema registry not compatible in the same topic

I'm using Kafka schema registry for producing/consuming Kafka messages, for example I have two fields they are both string type, the pseudo schema as following： {"name": "test1", "type": "string"} {"name": "test2", "type": "string"} but after…

apache-kafka avro confluent-schema-registry

asked Apr 12 '18 at 15:57

Jack

5,540
13
65
113

votes

2 answers

How to generate fields of type String instead of CharSequence using Avro?

I wrote one Avro schema in which some of the fields ** need to be ** of type String but Avro has generated those fields of type CharSequence. I am not able to find any way to tell Avro to make those fields of type String. I tried to use "fields":…

java apache serialization avro

asked Aug 04 '14 at 12:28

Shekhar

11,438
36
130
186

votes

3 answers

Generate Avro Schema from certain Java Object

Apache Avro provides a compact, fast, binary data format, rich data structure for serialization. However, it requires user to define a schema (in JSON) for object which need to be serialized. In some case, this can not be possible (e.g: the class…

java serialization avro jsonschema

asked Apr 09 '14 at 06:18

Richard Le

votes

5 answers

How to fix Expected start-union. Got VALUE_NUMBER_INT when converting JSON to Avro on the command line?

I'm trying to validate a JSON file using an Avro schema and write the corresponding Avro file. First, I've defined the following Avro schema named user.avsc: {"namespace": "example.avro", "type": "record", "name": "user", "fields": [ …

json validation avro

asked Dec 15 '14 at 13:50

Emre Sevinç

8,211
14
64
105

2 3

…

99 100 Next