0

I am unable to import `from_avro`` in pyspark.

Trying to run a spark-submit job by invoking the external package for avro

Eg:

spark-submit --packages org.apache.spark:spark-avro_2.12:3.0.1 test1.py

My test1.py file contains the import statement:

from pyspark.sql.avro.functions import from_avro, to_avro

Getting:

ImportError: NO module names avro.function

How can I import from_avro using python code?

Azhar Khan
  • 3,829
  • 11
  • 26
  • 32
Ana
  • 11

2 Answers2

0

https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.avro.functions.from_avro.html?highlight=avro

as you can see here this is available since 3.0.0

which spark version are u using? If its under 3.0.0 it will not work like that.

Matt
  • 620
  • 4
  • 11
0

In case you are running Spark 2.4 instead of 3.0.1 indicated by the imported package, you need to write yourself a wrapper because Spark 2.4 has spark-avro only for Java/Scala. Follow the instructions in this answer:

from pyspark.sql.column import Column, _to_java_column 

def from_avro(col, jsonFormatSchema): 
    sc = SparkContext._active_spark_context 
    avro = sc._jvm.org.apache.spark.sql.avro
    f = getattr(getattr(avro, "package$"), "MODULE$").from_avro
    return Column(f(_to_java_column(col), jsonFormatSchema)) 


def to_avro(col): 
    sc = SparkContext._active_spark_context 
    avro = sc._jvm.org.apache.spark.sql.avro
    f = getattr(getattr(avro, "package$"), "MODULE$").to_avro
    return Column(f(_to_java_column(col))) 

Make sure that spark-avro dependency has the right version specified when providing to the --packages.

If the assumption about you running Spark version < 3 is incorrect, please provide more details.

oskarryn
  • 170
  • 2
  • 13