Execute multiple notebooks in parallel in pyspark databricks

Question

Question is simple:

master_dim.py calls dim_1.py and dim_2.py to execute in parallel. Is this possible in databricks pyspark?

Below image is explaning what am trying to do, it errors for some reason, am i missing something here?

score 14 · Accepted Answer · answered Aug 26 '21 at 23:44

14

Just for others in case they are after how it worked:

from multiprocessing.pool import ThreadPool
pool = ThreadPool(5)
notebooks = ['dim_1', 'dim_2']
pool.map(lambda path: dbutils.notebook.run("/Test/Threading/"+path, timeout_seconds= 60, arguments={"input-data": path}),notebooks)

answered Aug 26 '21 at 23:44

Chandra

371
3
10

you can just use `path` - in this case it's easier to move project into new folder, etc. if the `path` isn't absolute, then it's treated as relative to the current notebook – Alex Ott Aug 27 '21 at 06:16
The limitation with this approach is you can't share dependencies with the parallel jobs. I hope databricks can improve this so we can pass not only strings to the called notebook – hui chen May 11 '23 at 08:21
I will create level 2 list and run after the level 1 list has completed. Gives control. – Chandra May 12 '23 at 10:36

score 4 · Answer 2 · answered Aug 26 '21 at 12:12

your problem is that you're passing only Test/ as first argument to the dbutils.notebook.run (the name of notebook to execute), but you don't have notebook with such name.

You need either modify list of paths from ['Threading/dim_1', 'Threading/dim_2'] to ['dim_1', 'dim_2'] and replace dbutils.notebook.run('Test/', ...) with dbutils.notebook.run(path, ...)

Or change dbutils.notebook.run('Test/', ...) to dbutils.notebook.run('/Test/' + path, ...)

score 0 · Answer 3 · answered Oct 02 '22 at 03:57

0

Databricks now has workflows/multitask jobs. Your master_dim can call other jobs to execute in parallel after finishing/passing taskvalue parameters to dim_1, dim_2 etc.

answered Oct 02 '22 at 03:57

inder

61
2

Execute multiple notebooks in parallel in pyspark databricks

3 Answers3

Linked