8

I am trying to fix an issue with running out of memory, and I want to know whether I need to change these settings in the default configurations file (spark-defaults.conf) in the spark home folder. Or, if I can set them in the code.

I saw this question PySpark: java.lang.OutofMemoryError: Java heap space and it says that it depends on if I'm running in client mode. I'm running spark on a cluster and monitoring it using standalone.

But, how do I figure out if I'm running spark in client mode?

Community
  • 1
  • 1
makansij
  • 9,303
  • 37
  • 105
  • 183
  • 1
    By default, spark would run in the `client` mode. In case you want to change this, you can set the variable `--deploy-mode` to `cluster`. Since the default is `client` mode, unless you have made any changes, I suppose you would be running in the client mode itself. – KartikKannapur Jul 15 '16 at 05:01
  • That is useful information about the difference between the two modes, but that doesn't help me know if spark is running in cluster mode or client mode. I am working on a production environment, and I run pyspark in an IPython notebook. I have ssh access to the namenode, and I know where spark home is, but beyond that I don't know where to get the information about whether spark is running in `client` or `cluster` mode. Thank you. – makansij Jul 15 '16 at 16:14

3 Answers3

10

If you are running an interactive shell, e.g. pyspark (CLI or via an IPython notebook), by default you are running in client mode. You can easily verify that you cannot run pyspark or any other interactive shell in cluster mode:

$ pyspark --master yarn --deploy-mode cluster
Python 2.7.11 (default, Mar 22 2016, 01:42:54)
[GCC Intel(R) C++ gcc 4.8 mode] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Error: Cluster deploy mode is not applicable to Spark shells.

$ spark-shell --master yarn --deploy-mode cluster
Error: Cluster deploy mode is not applicable to Spark shells.

Examining the contents of the bin/pyspark file may be instructive, too - here is the final line (which is the actual executable):

$ pwd
/home/ctsats/spark-1.6.1-bin-hadoop2.6
$ cat bin/pyspark
[...]
exec "${SPARK_HOME}"/bin/spark-submit pyspark-shell-main --name "PySparkShell" "$@"

i.e. pyspark is actually a script run by spark-submit and given the name PySparkShell, by which you can find it in the Spark History Server UI; and since it is run like that, it goes by whatever arguments (or defaults) are included with its spark-submit command.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • OP asked about how to know the deploy mode of a *running* spark (application). – Jacek Laskowski Jul 20 '16 at 10:25
  • 4
    And you consider this reason for downvoting? At the end, my answer does address the question, which is how to *know*... Question is also clearly about the pyspark API, not Scala - nevertheless, I upvoted your answer because I learned something (and this is my main criterion for upvoting...). – desertnaut Jul 20 '16 at 10:51
  • Thanks @desertnaut. I appreciate the upvoting. My downvoting was to mark your answer as slightly offbase -- you didn't really answer the question (I may've not either but left the OP with a home work :)) – Jacek Laskowski Jul 20 '16 at 10:54
  • 5
    Thanks for the reply. I would argue though that "slightly offbase" answers do not deserve downvotings... – desertnaut Jul 20 '16 at 10:59
9

Since sc.deployMode is not available in PySpark, you could check out spark.submit.deployMode configuration property.

>>> sc.getConf().get("spark.submit.deployMode")
'client'

This is not available in PySpark

Use sc.deployMode

scala> sc.deployMode
res0: String = client

scala> sc.version
res1: String = 2.1.0-SNAPSHOT
Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
  • For me, `sc.getConf().get("spark.submit.deployMode")` worked - notice the extra paranthesis after `getConf` because it is a function. I am on pyspark==3.2.1 – akki Jul 06 '22 at 23:10
  • Correct. For PySpark users, the round brackets are a must (unlike Scala). – Jacek Laskowski Jul 18 '22 at 13:56
1

As of Spark 2+ the below works.

for item in spark.sparkContext.getConf().getAll():print(item)

(u'spark.submit.deployMode', u'client') # will be one of the items in the list.
desertnaut
  • 57,590
  • 26
  • 140
  • 166