16

I am running Pyspark 3.0.1 for Hadoop 2.7 in a Zeppelin notebook. In general all is well, however when I execute df.explain() on a DataFrame I get this error:

Fail to execute line 3: df.explain()
Traceback (most recent call last):
  File "/tmp/1610595392738-0/zeppelin_python.py", line 158, in <module>
    exec(code, _zcUserQueryNameSpace)
  File "<stdin>", line 3, in <module>
  File "/usr/local/spark/python/pyspark/sql/dataframe.py", line 356, in explain
    print(self._sc._jvm.PythonSQLUtils.explainString(self._jdf.queryExecution(), explain_mode))
TypeError: 'JavaPackage' object is not callable

Has anyone come across and resolved this error before in the context of explain ?

My spark/jars folder contents:

activation-1.1.1.jar
aircompressor-0.10.jar
algebra_2.12-2.0.0-M2.jar
alluxio-2.4.1-client.jar
antlr4-runtime-4.7.1.jar
antlr-runtime-3.5.2.jar
aopalliance-1.0.jar
aopalliance-repackaged-2.6.1.jar
apacheds-i18n-2.0.0-M15.jar
apacheds-kerberos-codec-2.0.0-M15.jar
api-asn1-api-1.0.0-M20.jar
api-util-1.0.0-M20.jar
arpack_combined_all-0.1.jar
arrow-format-0.15.1.jar
arrow-memory-0.15.1.jar
arrow-vector-0.15.1.jar
audience-annotations-0.5.0.jar
automaton-1.11-8.jar
avro-1.8.2.jar
avro-ipc-1.8.2.jar
avro-mapred-1.8.2-hadoop2.jar
bonecp-0.8.0.RELEASE.jar
breeze_2.12-1.0.jar
breeze-macros_2.12-1.0.jar
cats-kernel_2.12-2.0.0-M4.jar
chill_2.12-0.9.5.jar
chill-java-0.9.5.jar
commons-beanutils-1.9.4.jar
commons-cli-1.2.jar
commons-codec-1.10.jar
commons-collections-3.2.2.jar
commons-compiler-3.0.16.jar
commons-compress-1.8.1.jar
commons-configuration-1.6.jar
commons-crypto-1.0.0.jar
commons-dbcp-1.4.jar
commons-digester-1.8.jar
commons-httpclient-3.1.jar
commons-io-2.4.jar
commons-lang-2.6.jar
commons-lang3-3.9.jar
commons-logging-1.1.3.jar
commons-math3-3.4.1.jar
commons-net-3.1.jar
commons-pool-1.5.4.jar
commons-text-1.6.jar
compress-lzf-1.0.3.jar
core-1.1.2.jar
curator-client-2.7.1.jar
curator-framework-2.7.1.jar
curator-recipes-2.7.1.jar
datanucleus-api-jdo-4.2.4.jar
datanucleus-core-4.1.17.jar
datanucleus-rdbms-4.1.19.jar
derby-10.12.1.1.jar
dropwizard-metrics-hadoop-metrics2-reporter-0.1.2.jar
flatbuffers-java-1.9.0.jar
generex-1.0.2.jar
gson-2.2.4.jar
guava-14.0.1.jar
guice-3.0.jar
guice-servlet-3.0.jar
hadoop-annotations-2.7.4.jar
hadoop-auth-2.7.4.jar
hadoop-client-2.7.4.jar
hadoop-common-2.7.4.jar
hadoop-hdfs-2.7.4.jar
hadoop-mapreduce-client-app-2.7.4.jar
hadoop-mapreduce-client-common-2.7.4.jar
hadoop-mapreduce-client-core-2.7.4.jar
hadoop-mapreduce-client-jobclient-2.7.4.jar
hadoop-mapreduce-client-shuffle-2.7.4.jar
hadoop-yarn-api-2.7.4.jar
hadoop-yarn-client-2.7.4.jar
hadoop-yarn-common-2.7.4.jar
hadoop-yarn-server-common-2.7.4.jar
hadoop-yarn-server-web-proxy-2.7.4.jar
HikariCP-2.5.1.jar
hive-beeline-2.3.7.jar
hive-cli-2.3.7.jar
hive-common-2.3.7.jar
hive-exec-2.3.7-core.jar
hive-jdbc-2.3.7.jar
hive-llap-common-2.3.7.jar
hive-metastore-2.3.7.jar
hive-serde-1.2.1.spark2.jar
hive-serde-2.3.7.jar
hive-shims-0.23-2.3.7.jar
hive-shims-1.2.1.spark2.jar
hive-shims-2.3.7.jar
hive-shims-common-2.3.7.jar
hive-shims-scheduler-2.3.7.jar
hive-storage-api-2.7.1.jar
hive-vector-code-gen-2.3.7.jar
hk2-api-2.6.1.jar
hk2-locator-2.6.1.jar
hk2-utils-2.6.1.jar
htrace-core-3.1.0-incubating.jar
httpclient-4.5.6.jar
httpcore-4.4.12.jar
istack-commons-runtime-3.0.8.jar
ivy-2.4.0.jar
jackson-annotations-2.10.0.jar
jackson-core-2.10.0.jar
jackson-core-asl-1.9.13.jar
jackson-databind-2.10.0.jar
jackson-dataformat-yaml-2.10.0.jar
jackson-datatype-jsr310-2.10.3.jar
jackson-jaxrs-1.9.13.jar
jackson-mapper-asl-1.9.13.jar
jackson-module-jaxb-annotations-2.10.0.jar
jackson-module-paranamer-2.10.0.jar
jackson-module-scala_2.12-2.10.0.jar
jackson-xc-1.9.13.jar
jakarta.activation-api-1.2.1.jar
jakarta.annotation-api-1.3.5.jar
jakarta.inject-2.6.1.jar
jakarta.validation-api-2.0.2.jar
jakarta.ws.rs-api-2.1.6.jar
jakarta.xml.bind-api-2.3.2.jar
janino-3.0.16.jar
javassist-3.25.0-GA.jar
javax.inject-1.jar
javax.jdo-3.2.0-m3.jar
javax.servlet-api-3.1.0.jar
javolution-5.5.1.jar
jaxb-api-2.2.2.jar
jaxb-runtime-2.3.2.jar
jcl-over-slf4j-1.7.30.jar
jdo-api-3.0.1.jar
jersey-client-2.30.jar
jersey-common-2.30.jar
jersey-container-servlet-2.30.jar
jersey-container-servlet-core-2.30.jar
jersey-hk2-2.30.jar
jersey-media-jaxb-2.30.jar
jersey-server-2.30.jar
jetty-6.1.26.jar
jetty-sslengine-6.1.26.jar
jetty-util-6.1.26.jar
JLargeArrays-1.5.jar
jline-2.14.6.jar
joda-time-2.10.5.jar
jodd-core-3.5.2.jar
jpam-1.1.jar
json-1.8.jar
json4s-ast_2.12-3.6.6.jar
json4s-core_2.12-3.6.6.jar
json4s-jackson_2.12-3.6.6.jar
json4s-scalap_2.12-3.6.6.jar
jsp-api-2.1.jar
jsr305-3.0.0.jar
jta-1.1.jar
JTransforms-3.1.jar
jul-to-slf4j-1.7.30.jar
kryo-shaded-4.0.2.jar
kubernetes-client-4.9.2.jar
kubernetes-model-4.9.2.jar
kubernetes-model-common-4.9.2.jar
leveldbjni-all-1.8.jar
libfb303-0.9.3.jar
libthrift-0.12.0.jar
log4j-1.2.17.jar
logging-interceptor-3.12.6.jar
lz4-java-1.7.1.jar
machinist_2.12-0.6.8.jar
macro-compat_2.12-1.1.1.jar
mesos-1.4.0-shaded-protobuf.jar
metrics-core-4.1.1.jar
metrics-graphite-4.1.1.jar
metrics-jmx-4.1.1.jar
metrics-json-4.1.1.jar
metrics-jvm-4.1.1.jar
minlog-1.3.0.jar
netty-all-4.1.47.Final.jar
objenesis-2.5.1.jar
okhttp-3.12.6.jar
okio-1.15.0.jar
opencsv-2.3.jar
orc-core-1.5.10.jar
orc-mapreduce-1.5.10.jar
orc-shims-1.5.10.jar
oro-2.0.8.jar
osgi-resource-locator-1.0.3.jar
paranamer-2.8.jar
parquet-column-1.10.1.jar
parquet-common-1.10.1.jar
parquet-encoding-1.10.1.jar
parquet-format-2.4.0.jar
parquet-hadoop-1.10.1.jar
parquet-jackson-1.10.1.jar
postgresql-42.2.14.jar
protobuf-java-2.5.0.jar
py4j-0.10.9.jar
pyrolite-4.30.jar
RoaringBitmap-0.7.45.jar
scala-collection-compat_2.12-2.1.1.jar
scala-compiler-2.12.10.jar
scala-library-2.12.10.jar
scala-parser-combinators_2.12-1.1.2.jar
scala-reflect-2.12.10.jar
scala-xml_2.12-1.2.0.jar
shapeless_2.12-2.3.3.jar
shims-0.7.45.jar
slf4j-api-1.7.30.jar
slf4j-log4j12-1.7.30.jar
snakeyaml-1.24.jar
snappy-java-1.1.7.5.jar
spark-catalyst_2.12-3.0.1.jar
spark-core_2.12-3.0.1.jar
spark-graphx_2.12-3.0.1.jar
spark-hive_2.12-3.0.1.jar
spark-hive-thriftserver_2.12-3.0.1.jar
spark-kubernetes_2.12-3.0.1.jar
spark-kvstore_2.12-3.0.1.jar
spark-launcher_2.12-3.0.1.jar
spark-mesos_2.12-3.0.1.jar
spark-mllib_2.12-3.0.1.jar
spark-mllib-local_2.12-3.0.1.jar
spark-network-common_2.12-3.0.1.jar
spark-network-shuffle_2.12-3.0.1.jar
spark-repl_2.12-3.0.1.jar
spark-sketch_2.12-3.0.1.jar
spark-sql_2.12-3.0.1.jar
spark-streaming_2.12-3.0.1.jar
spark-tags_2.12-3.0.1.jar
spark-tags_2.12-3.0.1-tests.jar
spark-unsafe_2.12-3.0.1.jar
spark-yarn_2.12-3.0.1.jar
spire_2.12-0.17.0-M1.jar
spire-macros_2.12-0.17.0-M1.jar
spire-platform_2.12-0.17.0-M1.jar
spire-util_2.12-0.17.0-M1.jar
ST4-4.0.4.jar
stax-api-1.0.1.jar
stax-api-1.0-2.jar
stream-2.9.6.jar
super-csv-2.2.0.jar
threeten-extra-1.5.0.jar
transaction-api-1.1.jar
univocity-parsers-2.9.0.jar
velocity-1.5.jar
xbean-asm7-shaded-4.15.jar
xercesImpl-2.12.0.jar
xml-apis-1.4.01.jar
xmlenc-0.52.jar
xz-1.5.jar
zjsonpatch-0.3.0.jar
zookeeper-3.4.14.jar
zstd-jni-1.4.4-3.jar

I gather the error is saying something might not be in my classpath but I cant think what that might be ...

Phil
  • 598
  • 1
  • 9
  • 21

3 Answers3

19

I ran into this same issue on AWS with EMR 6.2.0 (also Spark 3.0.1 coincidentally?) and jupyter notebooks. The issue appears to be related to how pyspark is initialized. Specifically, the py4j Java imports.

The following import is supposed to be executed while the notebook kernel is being initialized but seems to be skipped. You just need to run this once per session.

from py4j.java_gateway import java_import
java_import(spark._sc._jvm, "org.apache.spark.sql.api.python.*")

Now df.explain() works as expected.

For future reference - when you see 'JavaPackage' object is not callable, it often means that the target Java class was not found. Either the class doesn't exist or the expected import hasn't been called.

Mike Park
  • 10,845
  • 2
  • 34
  • 50
  • 1
    Amazing. This fixed the issue as described above for me. I have EMR 6.5 and Spark 3.1.2. Is this a bug in EMR? – mherzog Apr 20 '22 at 02:24
  • 2
    This is a bug in Livy which EMR uses under the hood. Livy seems dead since 2 years. With Scala 2.12.10 and Spark 3.0.3 everything works, but using newer Spark versions produce this error. I hacked together a Livy using Scala 2.12.15 to be able to use Spark 3.3 and I get the same problem, using this import fixes it. – rabejens Nov 03 '22 at 16:11
  • I'm trying to impliment this in `emr-containers` to no avail (yet). It's a WIP between here and this source: https://aws.amazon.com/blogs/big-data/testing-data-quality-at-scale-with-pydeequ/ – nate Nov 22 '22 at 16:08
2

Mac Users / Linux Users

$nano ~/.bash_profile 
or 
$nano ~/.zshrc

add env

export SPARK_HOME=/usr/local/spark
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_SUBMIT_ARGS="--master local[*]"
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/*.zip:$PYTHONPATH
export PYSPARK_DRIVER_PYTHON=jupyter  
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'

Ctrl + o + Enter ##save Ctrl + X #exit

$source ~/.bash_profile
$nano ~/.zshrc
0

Pyspark loads the class using spark context jvm

sc_.jvm.com.package.Class

If the jar for the corresponding Class file is not supplied to zepplin or jupyter in its config, this error will be thrown.

Running from spark-submit

If your pyspark code requires additional jars add them to spark-submit with --jars option

e.g.

spark-submit pyspark-job.py --jars abc.jar,xyz.jar

This answer explains the mechanism of java classes being called from pyspark

Sorter
  • 9,704
  • 6
  • 64
  • 74