Read avro files in pyspark with PyCharm

Question

I'm quite new to spark, I've imported pyspark library to pycharm venv and write below code:

# Imports
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder \
    .appName('DataFrame') \
    .master('local[*]') \
    .getOrCreate()

spark.conf.set("spark.sql.shuffle.partitions", 5)
path = "file_path"
df = spark.read.format("avro").load(path)

, everything seems to be okay but when I want to read avro file I get message:

pyspark.sql.utils.AnalysisException: 'Failed to find data source: avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of "Apache Avro Data Source Guide".;'

When I go to this page: >https://spark.apache.org/docs/latest/sql-data-sources-avro.html there is something like this:

and I have no idea have to implement this, download something in PyCharm or you have to find external files to modify?

Thank you for help!

Update (2019-12-06): Because I'm using Anaconda I've opened Anaconda prompt and copied this code:

pyspark --packages com.databricks:spark-avro_2.11:4.0.0

It downloaded some modules, then I've got back to PyCharm and same error appears.

Does this answer your question? [How to add third party java jars for use in pyspark](https://stackoverflow.com/questions/27698111/how-to-add-third-party-java-jars-for-use-in-pyspark) — Oliver W., Dec 05 '19 at 08:56
It makes some sense but I don't have any *.jar files, should I download something? — cincin21, Dec 05 '19 at 09:21
That would be an option, yes. You can find those packages typically on [Maven Central](https://search.maven.org/). — Oliver W., Dec 05 '19 at 09:24
@cincin21 can you directly use --package and run with spark-submit with --packages — Mahesh Gupta, Dec 05 '19 at 09:38
can you try launching your pyspark shell like this and try running the code again `pyspark --packages com.databricks:spark-avro_2.11:4.0.0` — Prabhakar Reddy, Dec 05 '19 at 09:54
ok, so because I'm using anaconda, I've copied this code to anaconda prompt and it downloaded some modules, then I've got back to PyCharm and same error appears — cincin21, Dec 05 '19 at 13:58
Check the pycharm environment whether you're using the same environment or different than the one, you have downloaded the package into. — Shubhanshu, Dec 09 '19 at 12:19

score 4 · Accepted Answer · answered Dec 09 '19 at 17:35

I downloaded the pyspark version 2.4.4 package from conda in PyCharm. And added spark-avro_2.11-2.4.4.jar file in spark configuration and was able to sucessfully recreate your error i.e, pyspark.sql.utils.AnalysisException: 'Failed to find data source: avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of "Apache Avro Data Source Guide".;'

To fix this issue, follow below steps:

Uninstall pyspark package downloaded from conda.
Download and unzip spark-2.4.4-bin-hadoop2.7.tgz from here.
In Run > Environment Varibales, you should set SPARK_HOME to <download_path>/spark-2.4.3-bin-hadoop2.7 and set PYTHONPATH to $SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$SPARK_HOME/python
Download spark-avro_2.11-2.4.4.jar file from here.

Now you should be able to run pyspark code from PyCharm. Try below code:

# Imports
from pyspark.sql import SparkSession
from pyspark import SparkConf, SparkContext

#Create SparkSession
spark = SparkSession.builder \
    .appName('DataFrame') \
    .master('local[*]')\
    .config('spark.jars', '<path>/spark-avro_2.11-2.4.4.jar') \
    .getOrCreate()


df = spark.read.format('avro').load('<path>/userdata1.avro')

df.show()

The above code will work but PyCharm will complain about pyspark modules. To remove that and enable code completion feature follow below additional steps:

In Project Structure, click on Add Content root and add spark-2.4.3-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip

Now your project structure should look like:

Output:

Hi, I've done this step by step but now it does not see pyspark, `ModuleNotFoundError: No module named 'pyspark'` appears even after your additional step. — cincin21, Dec 10 '19 at 10:23
@cincin21 are you able to run the code atleast? If you haven't tried can you try running the code and let me know. If you are able to run the code and getting error only in Pycharm identifying pyspark you should add `/spark-2.4.3-bin-hadoop2.7/python` to content root and then try if it works. — wypul, Dec 10 '19 at 10:28
added 2nd path, seems ok now! Great work, could you only change "2.4.3" to "2.4.4" in some places of your walkthrough to keep it for the future users? best! — cincin21, Dec 10 '19 at 10:35

score 1 · Answer 2 · answered May 11 '21 at 16:00

1

pyspark --jars /<path_to>/spark-avro_<version>.jar
works for me with Spark 3.0.2

answered May 11 '21 at 16:00

ruslan_krivoshein

303
5
19

score 0 · Answer 3 · answered Dec 10 '19 at 05:49

0

Simple solution can be submitting the module in Terminal tab inside pycharm with spark-submit command as below.

General syntax of command:

spark-submit --packages <package_name> <script_path>

As avro is the package needed com.databricks:spark-avro_2.11:4.0.0 package should be included. So the final command will be

spark-submit --packages com.databricks:spark-avro_2.11:4.0.0 <script_path>

answered Dec 10 '19 at 05:49

j raj

167
1
2
9

1

One main thing I missed. While reading the dataframe, format should be ```com.databricks.spark.avro``` not simple ```avro``` like this ```df=spark.read.format('com.databricks.spark.avro').load('/user.avro')``` – j raj Dec 10 '19 at 12:03
@j raj -> I've tried this option and strange java error appeared: `py4j.protocol.Py4JJavaError: An error occurred while calling o34.load`. But maybe it will work on another machine ... – cincin21 Dec 10 '19 at 12:40
May i know the error? Usually, the error reason comes right after the statement you mentioned in the comment. – j raj Dec 10 '19 at 12:47
1

It says: `py4j.protocol.Py4JJavaError: An error occurred while calling o34.load. : java.lang.ClassNotFoundException: Failed to find data source: org.apache.spark.sql.avro.AvroFileFormat. Please find packages at http://spark.apache.org/third-party-projects.html` – cincin21 Dec 10 '19 at 13:18
Error says, the program is trying to read a file of ```org.apache.spark.sql.avro.AvroFileFormat``` format which is incorrect. Could you specify ```com.databricks.spark.avro``` as format and execute. It should work. – j raj Dec 11 '19 at 08:53
I'm using `df = spark.read.format('com.databricks.spark.avro').load(path)` , is this what you are looking for? – cincin21 Dec 11 '19 at 10:12
Yes.It should work. If it is not, i am not sure how to fix it. Sorry. :( – j raj Dec 11 '19 at 11:51
Sure, thank you for help, the above example works so I'm happy ;) – cincin21 Dec 11 '19 at 12:39
Avro is included in latest Spark from a separate package. Don't use the databricks one anymore – OneCricketeer Dec 30 '19 at 06:29

Anil Kumar · Answer 4 · 2021-11-22T21:55:55.470

Your Spark version and your avro JAR version should be in sync
ex: If you're using spark 3.1.2 and your avro jar version should be spark-avro_2.12-3.1.2.jar
Sample Code:

spark = SparkSession.builder.appName('DataFrame').\
        config('spark.jars','C:\\Users\\<<User_Name>>\\Downloads\\spark-avro_2.12-3.1.2.jar').getOrCreate()
df = spark.read.format('avro').load('C:\\Users\\<<user name>>\\Downloads\\sample.avro')
df.show()

Output:
+-------------------+-------+------+------------+------------+--------------------+-------+-----------+--------------------+-----------------+------------------+-------+------------+--------------+--------------+----------+--------------------+
|           datetime|country|region|publisher_id|placement_id|       impression_id|consent|   hostname|                uuid|placement_type_id|iab_device_type_id|site_id|request_type|placement_type|bid_url_domain|app_bundle|                 tps|
+-------------------+-------+------+------------+------------+--------------------+-------+-----------+--------------------+-----------------+------------------+-------+------------+--------------+--------------+----------+--------------------+
|2021-07-30 14:55:18|   null|  null|        5016|        5016|8bdf2cf1-3a17-473...|      4|test.server|9515d578-9ee0-462...|                0|                 5|   5016|      advast|         video|          null|      null|{5016 -> {5016, n...|
|2021-07-30 14:55:22|   null|  null|        2702|        2702|ab3b63d1-a916-4d7...|      4|test.server|9515d578-9ee0-462...|                1|                 2|   2702|         adi|        banner|          null|      null|{2702 -> {2702, n...|
|2021-07-30 14:55:24|   null|  null|        1106|        1106|574f078f-0fc6-452...|      4|test.server|9515d578-9ee0-462...|                1|                 2|   1106|         adi|        banner|          null|      null|{1106 -> {1106, n...|
|2021-07-30 14:55:25|   null|  null|        1107|        1107|54bf5cf8-3438-400...|      4|test.server|9515d578-9ee0-462...|                1|                 2|   1107|         adi|        banner|          null|      null|{1107 -> {1107, n...|
|2021-07-30 14:55:27|   null|  null|        4277|        4277|b3508668-3ad5-4db...|      4|test.server|9515d578-9ee0-462...|                1|                 2|   4277|         adi|        banner|          null|      null|{4277 -> {4277, n...|
+-------------------+-------+------+------------+------------+--------------------+-------+-----------+--------------------+-----------------+------------------+-------+------------+--------------+--------------+----------+--------------------+

Read avro files in pyspark with PyCharm

4 Answers4