7

I have created 3 different notebook using pyspark code in Azure synapse Analytics. Notebook is running using spark pool. There is only one spark pool for all 3 notebook. when these 3 notebook run individually, spark pool starts for all 3 notebook by default.

The issue which i am facing is related to spark pool. It is taking 10 minutes to start in each notebook. The Vcores assigned is 4 and executor is 1. Can somebody please help me to know how can we boost the start of spark pool in azure synapse Analytics.

kshitiz sinha
  • 113
  • 1
  • 7
  • If my answer is useful for you, could you please [accept it as an answer](https://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work)? It may help more people who have similar issue. – CHEEKATLAPRADEEP Dec 02 '20 at 09:49
  • 1
    Did you visit Spark pausing setting and set the number of idle minutes to whatever time you want? It is not clear why spark pool start every time for each notebook. – Summer Jun 01 '21 at 23:15
  • 1
    have you gotten a fix for this? i'm also having the same issue. – oneDerer Sep 13 '21 at 09:32
  • Yes, you donot have to split the cells unless it is not required to change language for coding – kshitiz sinha Sep 21 '21 at 15:24
  • @kshitizsinha so in your notebooks, you only have one cell? how much time is reduced after you did that? – oneDerer Oct 06 '21 at 05:54
  • Before including code into single cell it was taking 10-12 minutes. After merging spark starts in 2-3 minutes approximately – kshitiz sinha Oct 08 '21 at 10:53

2 Answers2

1

I have this problem a lot too. It takes 4-5 minutes in my experience as well.

If it takes longer, make sure you publish (save) your notebook first, then reload the page. Sometimes that refreshes the underlying Livy session.

K.S.
  • 2,846
  • 4
  • 24
  • 32
-4

The performance of your Apache Spark pool jobs depends on multiple factors. These performance factors include:

  • How your data is stored
  • How the cluster has configured (Small, Medium, Large)
  • The operations that are used when processing the data.

Common challenges you might face include:

  • Memory constraints due to improperly sized executors.
  • Long-running operations
  • Tasks that result in cartesian operations.

There are also many optimizations that can help you overcome these challenges, such as caching and allowing for data skew.

The following article Optimize Apache Spark jobs (preview) in Azure Synapse Analytics describes common Spark job optimizations and recommendations.

CHEEKATLAPRADEEP
  • 12,191
  • 1
  • 19
  • 42
  • If my answer is helpful for you, you can accept it as an answer( click on the checkmark beside the answer to toggle it from greyed out to filled in.). This can be beneficial to other community members. Thank you. – CHEEKATLAPRADEEP Dec 01 '20 at 05:15
  • 5
    Unfortunately this answer does not even take the question into an account. You're describing _performance_ of the cluster not its initialisation time which I have personally found abysmally slow... (Tasks that take 5s to perform have to wait over 3 minutes to actually spin up the spark itself...????) – Michał 'PuszekSE' Nov 15 '21 at 14:37