Questions tagged [spark-avro]

A library for reading and writing Avro data from Spark SQL.

The GitHub page is here.

227 questions
19
votes
2 answers

Avro multiple record of same type in single schema

I like to use the same record type in an Avro schema multiple times. Consider this schema definition { "type": "record", "name": "OrderBook", "namespace": "my.types", "doc": "Test order update", "fields": [ { …
Daniel
  • 1,522
  • 1
  • 12
  • 25
16
votes
1 answer

Read an unsupported mix of union types from an Avro file in Apache Spark

I'm trying to switch from reading csv flat files to avro files on spark. following https://github.com/databricks/spark-avro I use: import com.databricks.spark.avro._ val sqlContext = new org.apache.spark.sql.SQLContext(sc) val df =…
Zahiro Mor
  • 1,708
  • 1
  • 16
  • 30
9
votes
3 answers

How to use spark-avro package to read avro file from spark-shell?

I'm trying to use the spark-avro package as described in Apache Avro Data Source Guide. When I submit the following command: val df = spark.read.format("avro").load("~/foo.avro") I get an error: java.util.ServiceConfigurationError:…
sahibeast
  • 341
  • 3
  • 13
7
votes
1 answer

How to convert a struct field in a Row to an avro record in Spark Java

I have a use case where I want to convert a struct field to an Avro record. The struct field originally maps to an Avro type. The input data is avro files and the struct field corresponds to a field in the input avro records. Below is what I want to…
JBT
  • 8,498
  • 18
  • 65
  • 104
7
votes
3 answers

Provider org.apache.spark.sql.avro.AvroFileFormat could not be instantiated

Unable to send avro format message to Kafka topic from spark streaming application. Very less information is available online about avro spark streaming example code. "to_avro" method doesn't require avro schema then how it will encode to avro…
amitwdh
  • 661
  • 2
  • 9
  • 19
7
votes
3 answers

How to query datasets in avro format?

this works with parquet val sqlDF = spark.sql("SELECT DISTINCT field FROM parquet.`file-path'") I tried the same way with Avro but it keeps giving me an error even if i use com.databricks.spark.avro. When I execute the following query: val sqlDF…
Akrem
  • 90
  • 1
  • 5
6
votes
1 answer

Deserialize Avro Spark

I'm pushing a stream of data to Azure EventHub with the following code leveraging Microsoft.Hadoop.Avro.. this code runs every 5 seconds, and simply plops the same two Avro serialised items : var strSchema = File.ReadAllText("schema.json"); var…
6
votes
4 answers

How to create an empty dataFrame in Spark

I have a set of Avro based hive tables and I need to read data from them. As Spark-SQL uses hive serdes to read the data from HDFS, it is much slower than reading HDFS directly. So I have used data bricks Spark-Avro jar to read the Avro files from…
Vinay Kumar
  • 1,664
  • 2
  • 15
  • 19
6
votes
2 answers

How to convert bytes from Kafka to their original object?

I am fetching data from Kafka and then deserialize the Array[Byte] using default decoder, and after that my RDD elements looks like (null,[B@406fa9b2), (null,[B@21a9fe0) but I want my original data which have a schema, so how can I achieve this? I…
JSR29
  • 354
  • 1
  • 5
  • 17
6
votes
3 answers

create hive external table with schema in spark

I am using spark 1.6 and I aim to create external hive table like what I do in hive script. To do this, I first read in the partitioned avro file and get the schema of this file. Now I stopped here, I get no idea how to apply this schema to my…
G_cy
  • 994
  • 3
  • 13
  • 28
5
votes
0 answers

Invalid sync error while reading avro file using spark or hive

I have an avro file which is created using JAVA api, when the writer was writing data in file the program shut down ungracefully due to machine reboot. Now when I am trying to read this file using spark/hive, it reads some data and then throws…
User_qwerty
  • 375
  • 1
  • 2
  • 10
5
votes
1 answer

Spark from_avro() dataframe.show() errors java.lang.ArrayIndexOutOfBoundsException

I converted an dataframe fields to avro field struct using to_avro, and back using from_avro like below. Ultimately I want to stream the avro payload to kafka write/read. When I tried to print the final reconverted dataframe by doing df.show() it…
Anand K
  • 293
  • 3
  • 12
5
votes
2 answers

Spark on Cluster: Read Large number of small avro files is taking too long to list

I know this problem of reading large number of small files in HDFS have always been an issue and been widely discussed, but bear with me. Most of the stackoverflow problems dealing with this type of issue concerns with reading a large number of txt…
ni_i_ru_sama
  • 304
  • 1
  • 13
5
votes
2 answers

Spark 2.4.0 Avro Java - cannot resolve method from_avro

I'm trying to run a spark stream from a kafka queue containing Avro messages. As per https://spark.apache.org/docs/latest/sql-data-sources-avro.html I should be able to use from_avro to convert column value to Dataset. However, I'm unable to…
Maciej C
  • 55
  • 3
  • 6
5
votes
1 answer

How to convert nested avro GenericRecord to Row

I have a code to convert my avro record to Row using function avroToRowConverter() directKafkaStream.foreachRDD(rdd -> { JavaRDD newRDD= rdd.map(x->{ Injection recordInjection =…
Sumit G
  • 436
  • 8
  • 21
1
2 3
15 16