I like to use the same record type in an Avro schema multiple times. Consider this schema definition
{
"type": "record",
"name": "OrderBook",
"namespace": "my.types",
"doc": "Test order update",
"fields": [
{
…
I'm trying to switch from reading csv flat files to avro files on spark.
following https://github.com/databricks/spark-avro
I use:
import com.databricks.spark.avro._
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df =…
I'm trying to use the spark-avro package as described in Apache Avro Data Source Guide.
When I submit the following command:
val df = spark.read.format("avro").load("~/foo.avro")
I get an error:
java.util.ServiceConfigurationError:…
I have a use case where I want to convert a struct field to an Avro record. The struct field originally maps to an Avro type. The input data is avro files and the struct field corresponds to a field in the input avro records.
Below is what I want to…
Unable to send avro format message to Kafka topic from spark streaming application. Very less information is available online about avro spark streaming example code. "to_avro" method doesn't require avro schema then how it will encode to avro…
this works with parquet
val sqlDF = spark.sql("SELECT DISTINCT field FROM parquet.`file-path'")
I tried the same way with Avro but it keeps giving me an error even if i use com.databricks.spark.avro.
When I execute the following query:
val sqlDF…
I'm pushing a stream of data to Azure EventHub with the following code leveraging Microsoft.Hadoop.Avro.. this code runs every 5 seconds, and simply plops the same two Avro serialised items :
var strSchema = File.ReadAllText("schema.json");
var…
I have a set of Avro based hive tables and I need to read data from them. As Spark-SQL uses hive serdes to read the data from HDFS, it is much slower than reading HDFS directly. So I have used data bricks Spark-Avro jar to read the Avro files from…
I am fetching data from Kafka and then deserialize the Array[Byte] using default decoder, and after that my RDD elements looks like (null,[B@406fa9b2), (null,[B@21a9fe0) but I want my original data which have a schema, so how can I achieve this?
I…
I am using spark 1.6 and I aim to create external hive table like what I do in hive script. To do this, I first read in the partitioned avro file and get the schema of this file. Now I stopped here, I get no idea how to apply this schema to my…
I have an avro file which is created using JAVA api, when the writer was writing data in file the program shut down ungracefully due to machine reboot.
Now when I am trying to read this file using spark/hive, it reads some data and then throws…
I converted an dataframe fields to avro field struct using to_avro, and back using from_avro like below. Ultimately I want to stream the avro payload to kafka write/read.
When I tried to print the final reconverted dataframe by doing df.show() it…
I know this problem of reading large number of small files in HDFS have always been an issue and been widely discussed, but bear with me. Most of the stackoverflow problems dealing with this type of issue concerns with reading a large number of txt…
I'm trying to run a spark stream from a kafka queue containing Avro messages.
As per https://spark.apache.org/docs/latest/sql-data-sources-avro.html I should be able to use from_avro to convert column value to Dataset.
However, I'm unable to…
I have a code to convert my avro record to Row using function avroToRowConverter()
directKafkaStream.foreachRDD(rdd -> {
JavaRDD newRDD= rdd.map(x->{
Injection recordInjection =…