koalas throws ' Can't get attribute _fill_function' on

Question

When I run the following code in a python script and run it with python directly I get the error below. When I start a pyspark session and then do the import of koalas, the creation of the data frame and call head() it runs fine and gives me the expected output.

Is there a specific way the SparkSession needs to be set up for koalas to work?

from pyspark.sql import SparkSession
import pandas as pd
import databricks.koalas as ks


spark = SparkSession.builder \
        .master("local[*]") \
        .appName("Pycedro Spark Application") \
        .getOrCreate()


kdf = ks.DataFrame({"a" : [4 ,5, 6],
                    "b" : [7, 8, 9],
                    "c" : [10, 11, 12]})

print(kdf.head())

Error when running it in a python script:

    File "/usr/local/Cellar/apache-spark/3.1.1/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 586, in main
    func, profiler, deserializer, serializer = read_command(pickleSer, infile)
  File "/usr/local/Cellar/apache-spark/3.1.1/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 69, in read_command
    command = serializer._read_with_length(file)
  File "/usr/local/Cellar/apache-spark/3.1.1/libexec/python/lib/pyspark.zip/pyspark/serializers.py", line 160, in _read_with_length
    return self.loads(obj)
  File "/usr/local/Cellar/apache-spark/3.1.1/libexec/python/lib/pyspark.zip/pyspark/serializers.py", line 430, in loads
    return pickle.loads(obj, encoding=encoding)
AttributeError: Can't get attribute '_fill_function' on <module 'pyspark.cloudpickle' from '/usr/local/Cellar/apache-spark/3.1.1/libexec/python/lib/pyspark.zip/pyspark/cloudpickle/__init__.py'>

    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:517)
[...]

Versions: koalas: 1.7.0 pyspark: Version: 3.0.2

The path says "apache-spark/3.1.1", but you had pyspark 3.0.2. Maybe a version conflict? — mck, Mar 22 '21 at 12:43
you are right! ... I should have caught that myself... Thanks a lot!!! — user3060684, Mar 22 '21 at 13:05

score 5 · Answer 1 · answered Jun 15 '21 at 05:38

I had a similar problem with PySpark. Upgrading the PySpark from version 3.0.2 to 3.1.2 solved the problem. Here is more information:

Hadoop version: 3.2.2
Spark version: 3.1.2
Python version: 3.8.5

Interestingly

df = spark.read.csv("hdfs:///data.csv")
df.show(2)

worked well, but

sc.textFile("hdfs:///data.csv") 
sc.take(2)

resulted in the following error:

AttributeError: Can't get attribute '_fill_function' on <module 'pyspark.cloudpickle' from '/opt/spark/python/lib/pyspark.zip/pyspark/cloudpickle/__init__.py'>

Upgrading PySpark solved the problem. Idea to upgrade came from the following link: https://issues.apache.org/jira/browse/SPARK-29536

Upgrading PySpark solved the problem. Idea to upgrade came from the following link: https://issues.apache.org/jira/browse/SPARK-29536 worked for me too — Exorcismus, Jul 05 '21 at 12:27

koalas throws ' Can't get attribute _fill_function' on

1 Answers1