2

We are facing problem while reading avro file in spark2-shell in Spark2.4 Any pointers will be of great help.

We were using following method to read avro files in spark2.3, but this support has been removed in Spark2.4:

spark2-shell --jars /tmp/spark/spark-avro_2.11-4.0.0.jar
import org.apache.avro.Schema
spark.sqlContext.sparkContext.hadoopConfiguration.set("avro.mapred.ignore.inputs.without.extension",     "true")
val df = spark.read.format("com.databricks.spark.avro").option("header", "true").option("mode", "DROPMALFORMED").load("<DIR_PATH_FOR_AVRO>")
  • Spark 2.4 documentation provides following details:

(https://spark.apache.org/docs/latest/sql-data-sources-avro.html)

./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:2.4.4

But we get following exception while using this approach:

Exception in thread "main" java.lang.RuntimeException: 
[unresolved dependency: org.apache.spark#spark-avro_2.12;2.4.4: not found]
 at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1306)
 at org.apache.spark.deploy.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:54)
 at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:315)
 at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:143)
 at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
 at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Have also tried:

spark2-shell --packages org.apache.spark:spark-avro_2.12:2.4.4 --jars /tmp/spark/spark-avro_2.12-2.4.0.jar
Anuj Mehra
  • 320
  • 3
  • 19
  • Tried options provided on below link: https://stackoverflow.com/questions/55873023/how-to-use-spark-avro-package-to-read-avro-file-from-spark-shell but was not successful – Anuj Mehra Jan 24 '20 at 14:26
  • Did u tried this `spark2-shell --jars /tmp/spark/spark-avro_2.12-2.4.4.jar` ? – 54l3d Jan 24 '20 at 14:59
  • Are you able to access https://repo1.maven.org/maven2/org/apache/spark/spark-avro_2.12/2.4.4/spark-avro_2.12-2.4.4.jar from the box you run your `spark-shell`? – mazaneicha Jan 25 '20 at 16:02
  • According to this JIRA https://issues.apache.org/jira/browse/SPARK-24768, Avro data source is supported natively in 2.4 – mazaneicha Jan 27 '20 at 18:23
  • Getting following error: spark.read.format("avro").load("") org.apache.spark.sql.AnalysisException: Failed to find data source: avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of "Apache Avro Data Source Guide".; at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:647) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) ... 49 elided – Anuj Mehra Jan 28 '20 at 10:56
  • Tried both of the option: 1. spark2-shell 2. spark2-shell --jars /tmp/spark-avro_2.12-2.4.4.jar – Anuj Mehra Jan 28 '20 at 10:56

2 Answers2

2

The "Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: org.apache.spark#spark-avro_2.12;2.4.4: not found] ..." seems like an issue accessing central maven repo at https://repo1.maven.org/maven2/, probably because your environment is using a proxy.

So I think you're on the right path - you can manually download a jar spark-avro_2.1x-2.4.x.jar from https://repo1.maven.org/maven2/org/apache/spark/spark-avro_2.xx/2.4.x/, transfer it to your node, and use spark2-shell --jar spark-avro_2.xx-2.4.x.jar to start REPL shell.

Looks like you're using Cloudera distro for Spark 2.4. Its latest maint version is 2.4.2, and it is still based on Scala 2.11, so I think you're looking for jar spark-avro_2.11-2.4.2.jar.

With that jar, things seems to be working okay for me:

$ spark2-shell --jars ~/.m2/repository/org/apache/spark/spark-avro_2.11/2.4.2/spark-avro_2.11-2.4.2.jar
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://xxxxxxx.xxxnet:4056
Spark context available as 'sc' (master = yarn, app id = application_xxxxxxxxxxxxx_xxxxx).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.0.cloudera2
      /_/

Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_221)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val df = spark.read.format("avro").load("/some/hdfs/path/kilo_sample.avro")
df: org.apache.spark.sql.DataFrame = [registration_dttm: string, id: bigint ... 11 more fields]

scala> df.show(false)
+--------------------+---+----------+---------+------------------------+------+---------------+-------------------+----------------------+----------+---------+----------------------------+----------------------------+
|registration_dttm   |id |first_name|last_name|email                   |gender|ip_address     |cc                 |country               |birthdate |salary   |title                       |comments                    |
+--------------------+---+----------+---------+------------------------+------+---------------+-------------------+----------------------+----------+---------+----------------------------+----------------------------+
|2016-02-03T07:55:29Z|1  |Amanda    |Jordan   |ajordan0@com.com        |Female|1.197.201.2    |6759521864920116   |Indonesia             |3/8/1971  |49756.53 |Internal Auditor            |1E+02                       |
|2016-02-03T17:04:03Z|2  |Albert    |Freeman  |afreeman1@is.gd         |Male  |218.111.175.34 |null               |Canada                |1/16/1968 |150280.17|Accountant IV               |                            |
...
|2016-02-03T10:30:36Z|20 |Rebecca   |Bell     |rbellj@bandcamp.com     |Female|172.215.104.127|null               |China                 |          |137251.19|                            |                            |
+--------------------+---+----------+---------+------------------------+------+---------------+-------------------+----------------------+----------+---------+----------------------------+----------------------------+
only showing top 20 rows

scala>

If you still have trouble after trying this version, please update your question with the complete stacktrace so we can see excatly what the problem is.

mazaneicha
  • 8,794
  • 4
  • 33
  • 52
  • 2
    Thanks a lot. This worked for me. We are using following version of spark: version 2.4.0.cloudera2. The compatible jar is spark-avro_2.11-2.4.0-cdh6.1.0.jar. Thanks a lot for your help. – Anuj Mehra Jan 28 '20 at 18:12
0
import sys
import os
from datetime import datetime
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession, SQLContext
from pyspark.sql.types import (StructType, StructField as Fld,
                               DateType as Date, FloatType as Float)
from pyspark.sql.functions import col

spark = SparkSession \
        .builder \
        .config("spark.jars", "C:\spark\spark-avro_2.12-3.1.1.jar") \
        .getOrCreate()

sc = spark.sparkContext
sqlContext = SQLContext(sc)

df = sqlContext.read.format("avro").load('hdfs:///user/pragmatyca/cliente.avro')

df.show(20, False)
buzatto
  • 9,704
  • 5
  • 24
  • 33