SnappyData: Connect Standalone Spark Job to Embedded Cluster

Question

What I'm trying to achieve is similar to Smart Connector mode, but the documentation isn't helping me much, because the Smart Connector examples are based on Spark-Shell, whereas I'm trying to run a standalone Scala application. Therefore, I can't use the --conf arguments for Spark-Shell.

Trying to find my spark master, I looked into the SnappyData web interface. I found the following:

host-data="false"
 locators="xxx.xxx.xxx.xxx:10334"
 log-file="snappyleader.log"
 mcast-port="0"
 member-timeout="30000"
 persist-dd="false"
 route-query="false"
 server-groups="IMPLICIT_LEADER_SERVERGROUP"
 snappydata.embedded="true"
 spark.app.name="SnappyData"
 spark.closure.serializer="org.apache.spark.serializer.PooledKryoSerializer"
 spark.driver.host="xxx.xxx.xxx.xxx"
 spark.driver.port="37838"
 spark.executor.id="driver"
 spark.local.dir="/var/opt/snappydata/lead1/scratch"
 spark.master="snappydata://xxx.xxx.xxx.xxx:10334"
 spark.memory.manager="org.apache.spark.memory.SnappyUnifiedMemoryManager"
 spark.memory.storageFraction="0.5"
 spark.scheduler.mode="FAIR"
 spark.serializer="org.apache.spark.serializer.PooledKryoSerializer"
 spark.ui.port="5050"
 statistic-archive-file="snappyleader.gfs"
--- end --

(The IP addresses are all on one host, for now.)

I have a simple example Spark job, just to test getting my cluster working:

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SnappySession
import org.apache.spark.sql.Dataset

object snappytest{
  case class Person(name: String, age: Long)

  def main(args: Array[String]): Unit = {
    val spark: SparkSession = SparkSession
      .builder()
      .appName("SnappyTest")
      .master("snappydata://xxx.xxx.xxx.xxx:10334")
      .getOrCreate()
    val snappy = new SnappySession(spark.sparkContext)

    import spark.implicits._

    val caseClassDS = Seq(Person("Andy", 35)).toDS()
    println(Person)
    println(snappy)
    println(spark)
  }
}

And I got this error:

17/10/25 14:44:57 INFO ServerConnector: Started Spark@ffaaaf0{HTTP/1.1}{0.0.0.0:4040}
17/10/25 14:44:57 INFO Server: Started @2743ms
17/10/25 14:44:57 INFO Utils: Successfully started service 'SparkUI' on port 4040.
17/10/25 14:44:57 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://xxx.xxx.xxx.xxx:4040
17/10/25 14:44:57 INFO SnappyEmbeddedModeClusterManager: setting from url snappydata.store.locators with xxx.xxx.xxx.xxx:10334
17/10/25 14:44:58 INFO LeadImpl: cluster configuration after overriding certain properties 
jobserver.enabled=false
snappydata.embedded=true
snappydata.store.host-data=false
snappydata.store.locators=xxx.xxx.xxx.xxx:10334
snappydata.store.persist-dd=false
snappydata.store.server-groups=IMPLICIT_LEADER_SERVERGROUP
spark.app.name=SnappyTest
spark.driver.host=xxx.xxx.xxx.xxx
spark.driver.port=35602
spark.executor.id=driver
spark.master=snappydata://xxx.xxx.xxx.xxx:10334
17/10/25 14:44:58 INFO LeadImpl: passing store properties as {spark.driver.host=xxx.xxx.xxx.xxx, snappydata.embedded=true, spark.executor.id=driver, persist-dd=false, spark.app.name=SnappyTest, spark.driver.port=35602, spark.master=snappydata://xxx.xxx.xxx.xxx:10334, member-timeout=30000, host-data=false, default-startup-recovery-delay=120000, server-groups=IMPLICIT_LEADER_SERVERGROUP, locators=xxx.xxx.xxx.xxx:10334}
NanoTimer::Problem loading library from URL path: /home/jpride/.ivy2/cache/io.snappydata/gemfire-core/jars/libgemfirexd64.so: java.lang.UnsatisfiedLinkError: no gemfirexd64 in java.library.path
NanoTimer::Problem loading library from URL path: /home/jpride/.ivy2/cache/io.snappydata/gemfire-core/jars/libgemfirexd64.so: java.lang.UnsatisfiedLinkError: no gemfirexd64 in java.library.path
Exception in thread "main" org.apache.spark.SparkException: Primary Lead node (Spark Driver) is already running in the system. You may use smart connector mode to connect to SnappyData cluster.

So how do I (should I?) use smart connector mode in this case?

score 2 · Accepted Answer · answered Oct 25 '17 at 22:03

2

You need to specify following in your example spark job -

.master("local[*]")
.config("snappydata.connection", "xxx.xxx.xxx.xxx:1527")

answered Oct 25 '17 at 22:03

Yogesh Mahajan

241
1
4

That ran, thanks! How did you know it was port 1527? Default? Where could I find that? But it did not fully meet the intent; rather than run a separate Spark cluster in smart connector mode, what I'm trying to do is connect to the SnappyData cluster to run the job ... but without using Snappy-Submit. That way, I can locally debug the app, but observe cluster behavior. The error message says "... (Spark Driver) is already running in the system." How do I find that driver to connect as master, instead of using "local[*]"? – Joseph Pride Oct 25 '17 at 22:23
1

So, you want to run your job within some debug environment (IDE) but still operate in the embedded mode ? Don't think this is possible. Today, to use the embedded mode you must submit the job to the Rest endpoint (lead). – jagsr Oct 25 '17 at 22:57
1

What are you trying to understand? if you run any Snappy job, the Spark tab in the dashboard will show you enough details for what went on to execute the job. And, running in local mode you can still step through the debugger to understand the inner workings. – jagsr Oct 25 '17 at 23:02
I guess that's the path I'll have to take until that option is available. I was trying to cheat the system by having the best of both Embedded and Smart Connector modes. My best guess at what's happening in the "Lead" system is that it doesn't have an actual Spark master until I snappy-submit a job; then the Lead spawns a Master for the job. Therefore, there's no Master I can connect to in a traditional Spark way unless I spawn my own with local[*], but then it's not embedded with the data servers any more. Is that roughly correct? – Joseph Pride Oct 25 '17 at 23:17
1

@JosephPride What actually happens is that lead has already spawned a driver and corresponding executors on store nodes. Which is why it is embedded mode meaning the data node and executors are same JVMs (and hence data is right there). This means the job has to run on already spawned driver+executors in embedded mode which is done by either snappy-job.sh, or JDBC/ODBC. Else need to use connector mode which where executors pull data from store. – Sumedh Nov 01 '17 at 19:21

SnappyData: Connect Standalone Spark Job to Embedded Cluster

1 Answers1