Databricks: Issue while creating spark data frame from pandas

Question

I have a pandas data frame which I want to convert into spark data frame. Usually, I use the below code to create spark data frame from pandas but all of sudden I started to get the below error, I am aware that pandas has removed iteritems() but my current pandas version is 2.0.0 and also I tried to install lesser version and tried to created spark df but I still get the same error. The error invokes inside the spark function. What is the solution for this? which pandas version should I install in order to create spark df. I also tried to change the runtime of cluster databricks and tried re running but I still get the same error.

import pandas as pd
spark.createDataFrame(pd.DataFrame({'i':[1,2,3],'j':[1,2,3]}))

error:-
UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below:
  'DataFrame' object has no attribute 'iteritems'
Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
  warn(msg)
AttributeError: 'DataFrame' object has no attribute 'iteritems'

score 21 · Accepted Answer · answered Apr 04 '23 at 08:09

It's related to the Databricks Runtime (DBR) version used - the Spark versions in up to DBR 12.2 rely on .iteritems function to construct a Spark DataFrame from Pandas DataFrame. This issue was fixed in the Spark 3.4 that is available as DBR 13.x.

If you can't upgrade to DBR 13.x, then you need to downgrade the Pandas to latest 1.x version (1.5.3 right now) by using %pip install -U pandas==1.5.3 command in your notebook. Although it's just better to use Pandas version shipped with your DBR - it was tested for compatibility with other packages in DBR.

This is the ideal answer that i was looking for. Thanks. – Ak777 Apr 19 '23 at 00:33 — Ak777, Apr 19 '23 at 00:33

Gordon · Answer 2 · 2023-07-07T04:27:33.170

8

I couldn't change package versions, but it looks like this was a name change only.

So I did

df.iteritems = df.items

and spark.createDataFrame(df) works now.

Sure, it's ugly, and it will break my notebook when I move to a cluster with a new DBR, but it works for now.

edited Jul 07 '23 at 04:27

answered May 31 '23 at 18:21

Gordon

19,811
4
36
74

score 2 · Answer 3 · answered Apr 04 '23 at 07:46

The Arrow optimization is failing because of the missing 'iteritems' attribut. You should try disabling the Arrow optimization in your Spark session and create the DataFrame without Arrow optimization.

Here is how it would work:

import pandas as pd
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder \
    .appName("Pandas to Spark DataFrame") \
    .getOrCreate()

# Disable Arrow optimization
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "false")

# Create a pandas DataFrame
pdf = pd.DataFrame({'i': [1, 2, 3], 'j': [1, 2, 3]})

# Convert pandas DataFrame to Spark DataFrame
sdf = spark.createDataFrame(pdf)

# Show the Spark DataFrame
sdf.show()

It should work but also if you want you can downgrade your pandas version for the Arrow optimisation like that pip install pandas==1.2.5

Have you clear the cache before running your code again ? also try my code — Saxtheowl, Apr 04 '23 at 08:08

Ranga Reddy · Answer 4 · 2023-05-23T05:15:01.883

2

This issue is occurred due to pandas version <= 2.0. In Pandas 2.0, .iteritems function is removed.

There are two solutions for this issue.

Down grade the pandas version < 2. For example,

pip install -U pandas==1.5.3

Use the latest Spark version i.e 3.4

edited May 23 '23 at 05:15

answered Apr 12 '23 at 11:41

Ranga Reddy

2,936
4
29
41

score 0 · Answer 5 · answered Aug 21 '23 at 07:25

0

if you want to keep version that you have of pandas try this :

import pandas as pd
pd.DataFrame.iteritems = pd.DataFrame.items

answered Aug 21 '23 at 07:25

ayoub hamaoui

13
2

Databricks: Issue while creating spark data frame from pandas

5 Answers5

Linked

Related