4

What is the purpose of the getOrCreate method from SparkContext class? I don't understand when we should use this method.

If I have 2 spark applications that are run with spark-submit, and in the main method I instantiate the spark context with SparkContext.getOrCreate, both app will have the same context?

Or the purpose is simpler, and the only purpose is when I create a spark app, and I don't want to send the spark context as a parameter to a method, and I will get it as a singleton object?

Cosmin
  • 676
  • 7
  • 15

1 Answers1

1

If I have 2 spark applications that are run with spark-submit, and in the main method I instantiate the spark context with SparkContext.getOrCreate, both app will have the same context?

No, SparkContext is a local object. It is not shared between applications.

when I create a spark app, and I don't want to send the spark context as a parameter to a method, and I will get it as a singleton object?

This is exactly the reason. SparkContext (or SparkSession) are ubiquitous in Spark applications and core Spark's source, and passing them around would a huge burden.

It also useful for multithreaded applications where arbitrary thread can initalize contexts.

About docs:

is function may be used to get or instantiate a SparkContext and register it as a singleton object. Because we can only have one active SparkContext per JVM, this is useful when applications may wish to share a SparkContext.

Driver runs in its own JVM and there is no built-in mechanism to share it between multiple full-fledged Java applications (proper application executing its own main. Check Is there one JVM per Java application? and Why have one JVM per application? for related general questions). Application refers to "logical application" where multiple modules execute its own code - one example is SparkJob on spark-jobserver. This scenario is no different than passing SparkContext to a function.

Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
  • Your first point is incorrect according the the docs: [This function may be used to get or instantiate a SparkContext and register it as a singleton object. Because we can only have one active SparkContext per JVM, this is useful when applications may wish to share a SparkContext.](https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/SparkContext.html#getOrCreate()) – Jeremy Dec 14 '17 at 18:59
  • @Jeremy It is not incorrect. In normal case you can have only one application per JVM (Spark application or not) - [Is there one JVM per Java application?](https://stackoverflow.com/q/5947207/8371915) and [Why have one JVM per application?](https://stackoverflow.com/q/13539132/8371915). – Alper t. Turker Dec 14 '17 at 20:18
  • I was responding to your statement, "It is not shared between applications". The SparkContext, using getOrCreate(), is shared. It creates the SparkContext if one does not already exist or, if one already exists, it gets an instance of it. Here is the relevant description from the [Apache Spark source](https://github.com/apache/spark/blob/6d99940397136e4ed0764f83f442bcffcb20d6e7/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L880) – Jeremy Dec 14 '17 at 20:55
  • 1
    @Jeremy And you cannot share the context in applications (in strict sense) cannot share a JVM. What you refer to is sharing context inside application. – Alper t. Turker Dec 14 '17 at 22:56
  • I have the same problem that you two are talking about. I think the word 'Application' from documentation is used wrong. Because, if the spark context is shared between many applications, then if one of those applications decides to close itself, then all the applications will be closed because have the same sparkcontext. On the other hand, the log that is written by the spark when instantiating the same context(Using an existing SparkContext some configuration may not take effect) is very confusing, because it doesn't make sense if I use that method to just not send the object as a parameter. – Cosmin Dec 15 '17 at 09:19
  • @Cosmin There actually platforms which share `SparkContext` ([jobserver](https://github.com/spark-jobserver/spark-jobserver) is the best known one) but in that case you don't submit full featured application (with `main` and such). `SparkContext` can be initialized once and its core configuration (`sql` or Hadoop config is a different thing) cannot be modified, once initialized - this is why you get a warning. – Alper t. Turker Dec 15 '17 at 14:30
  • Ok, so the jobserver is shared sparkcontext platform because it is running on the same JVM, or usually the jobs run on the same spark context, but on the different thread from what I know. That is the reason that we named jobserver as shared sparkcontext platform, right? – Cosmin Dec 15 '17 at 15:44