1

I'm looking to understanding conceptually how several Jupyter notebooks running on Spark kernels (such as SparkMagic) can share a cluster of worker nodes.

If User A persists or caches a large RDD (whether on disk or on memory) in a cell, and then goes away for the weekend but does not stop his/her notebook, will this degrade other users' ability to run their jobs while User A's notebook is running?

That is, all the Spark notebooks sharing the cluster will be able to submit jobs at the same time (do not have to run sequentially), but the resources will be divided up, right?

This is a general question, but for us we're running on an AWS Sagemaker and EMR environment in an US region, in case it makes a difference.

rishai
  • 495
  • 3
  • 15

1 Answers1

1

Sagemaker Notebooks backed by a single EMR cluster connect to the EMR cluster through Livy [1]. Livy on the EMR master node launches a Spark application and you can find the application in Yarn resource manager.

Each notebook will open a separate session and then it's up to the resource manager to decide which application can run depends on cluster's resources and which job has been submitted first.

If you want to control the resources assigned to each user/group you can configure the yarn scheduler with different queues [2].

Usually Livy kills unused sessions after a specific timeout [3], so sessions cannot run forever.